Pandas相当于Python的readlines函数

kil*_*les 2 python pandas

使用python的readlines()函数,我可以检索文件中每行的列表:

with open('dat.csv', 'r') as dat:
    lines = dat.readlines()
Run Code Online (Sandbox Code Playgroud)

我正在处理涉及非常大的文件的问题,并且此方法产生内存错误.有没有相当于Python readlines()功能的熊猫?该pd.read_csv()选项chunksize似乎在我的线上附加数字,这远非理想.

最小的例子:

In [1]: lines = []

In [2]: for df in pd.read_csv('s.csv', chunksize = 100):
   ...:     lines.append(df)
In [3]: lines
Out[3]: 
[   hello here is a line
 0  here is another line
 1  here is my last line]

In [4]: with open('s.csv', 'r') as dat:
   ...:     lines = dat.readlines()
   ...:     

In [5]: lines
Out[5]: ['hello here is a line\n', 'here is another line\n', 'here is my last line\n']

In [6]: cat s.csv
hello here is a line
here is another line
here is my last line
Run Code Online (Sandbox Code Playgroud)

Tha*_*nos 8

您应该尝试使用某些注释中提到的chunksize选项pd.read_csv().

这将强制一次pd.read_csv()读取一定数量的行,而不是一次性读取整个文件.它看起来像这样:

>> df = pd.read_csv(filepath, chunksize=1, header=None, encoding='utf-8')
Run Code Online (Sandbox Code Playgroud)

在上面的示例中,将逐行读取文件.

现在,实际上,根据文档pandas.read_csv,它不是pandas.DataFrame在这里返回的TextFileReader对象,而是一个对象.

  • chunksize:int,默认无

返回TextFileReader对象以进行迭代.有关iterator和chunksize的更多信息,请参阅IO Tools文档.

因此,为了完成练习,您需要将它放在这样的循环中:

In [385]: cat data_sample.tsv
This is a new line
This is another line of text
And this is the last line of text in this file

In [386]: lines = []

In [387]: for line in pd.read_csv('./data_sample.tsv', encoding='utf-8', header=None, chunksize=1):
    lines.append(line.iloc[0,0])
   .....:     

In [388]: print(lines)
['This is a new line', 'This is another line of text', 'And this is the last line of text in this file']
Run Code Online (Sandbox Code Playgroud)

我希望这有帮助!