使用一个小的10行测试文件,我尝试了2种方法 - 解析整个事物并选择最后N行,而不是加载所有行,但只解析最后N:
In [1025]: timeit np.genfromtxt('stack38704949.txt',delimiter=',')[-5:]
1000 loops, best of 3: 741 µs per loop
In [1026]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = f.readlines()
...: np.genfromtxt(lines[-5:],delimiter=',')
1000 loops, best of 3: 378 µs per loop
Run Code Online (Sandbox Code Playgroud)
这被标记为有效地将最后'n'行CSV读入DataFrame的副本.那里接受的答案
from collections import deque
Run Code Online (Sandbox Code Playgroud)
并收集了该结构中的最后N行.它还用于StringIO将行提供给解析器,这是一种不必要的复杂问题. genfromtxt从任何给它行的输入中获取输入,所以行列表就好了.
In [1031]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = deque(f,5)
...: np.genfromtxt(lines,delimiter=',')
1000 loops, best of 3: 382 µs per loop
Run Code Online (Sandbox Code Playgroud)
基本上同时readlines和切片.
deque当文件非常大时可能有优势,并且挂起所有线路的成本很高.我认为它不会节省任何文件读取时间.仍然需要逐行阅读.
用于定时row_count,随后skip_header的方法较慢; 它需要两次读取文件. skip_header还是要读线.
In [1046]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: ...: reader = csv.reader(f,delimiter = ",")
...: ...: data = list(reader)
...: ...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 760 µs per loop
Run Code Online (Sandbox Code Playgroud)
出于计算线的目的,我们不需要使用csv.reader,但它似乎不会花费更多的额外时间.
In [1048]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: lines=f.readlines()
...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
1000 loops, best of 3: 736 µs per loop
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
9584 次 |
| 最近记录: |