使用python的readlines()函数,我可以检索文件中每行的列表:
with open('dat.csv', 'r') as dat:
lines = dat.readlines()
Run Code Online (Sandbox Code Playgroud)
我正在处理涉及非常大的文件的问题,并且此方法产生内存错误.有没有相当于Python readlines()功能的熊猫?该pd.read_csv()选项chunksize似乎在我的线上附加数字,这远非理想.
最小的例子:
In [1]: lines = []
In [2]: for df in pd.read_csv('s.csv', chunksize = 100):
...: lines.append(df)
In [3]: lines
Out[3]:
[ hello here is a line
0 here is another line
1 here is my last line]
In [4]: with open('s.csv', 'r') as dat:
...: lines = dat.readlines()
...:
In [5]: lines
Out[5]: ['hello here is a line\n', 'here is another line\n', 'here is my last line\n']
In [6]: cat s.csv
hello here is a line
here is another line
here is my last line
Run Code Online (Sandbox Code Playgroud)
您应该尝试使用某些注释中提到的chunksize选项pd.read_csv().
这将强制一次pd.read_csv()读取一定数量的行,而不是一次性读取整个文件.它看起来像这样:
>> df = pd.read_csv(filepath, chunksize=1, header=None, encoding='utf-8')
Run Code Online (Sandbox Code Playgroud)
在上面的示例中,将逐行读取文件.
现在,实际上,根据文档pandas.read_csv,它不是pandas.DataFrame在这里返回的TextFileReader对象,而是一个对象.
- chunksize:int,默认无
返回TextFileReader对象以进行迭代.有关iterator和chunksize的更多信息,请参阅IO Tools文档.
因此,为了完成练习,您需要将它放在这样的循环中:
In [385]: cat data_sample.tsv
This is a new line
This is another line of text
And this is the last line of text in this file
In [386]: lines = []
In [387]: for line in pd.read_csv('./data_sample.tsv', encoding='utf-8', header=None, chunksize=1):
lines.append(line.iloc[0,0])
.....:
In [388]: print(lines)
['This is a new line', 'This is another line of text', 'And this is the last line of text in this file']
Run Code Online (Sandbox Code Playgroud)
我希望这有帮助!
| 归档时间: |
|
| 查看次数: |
10776 次 |
| 最近记录: |