将csv导入pandas数据帧时不读取所有行

imb*_*a22 6 csv machine-learning python-3.x pandas kaggle

我在这里尝试kaggle挑战,不幸的是我陷入了一个非常基本的步骤.我有限的python知识必须归咎于此.我试图通过执行以下命令将数据集读入pandas数据帧:

test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
Run Code Online (Sandbox Code Playgroud)

问题是,您发现的这个文件有超过300,000条记录,但我只阅读7945,21.

print (test.shape)
(7945, 21)
Run Code Online (Sandbox Code Playgroud)

现在我已经仔细检查了文件,我找不到关于行号7945的任何特殊信息.任何指针都说明为什么会发生这种情况.似乎非常普通的情况,我希望你们中有些人遇到过这个错误可以帮助我.

jez*_*ael 6

我认为最好是将read_csv函数与参数quoting=csv.QUOTE_NONE和一起使用error_bad_lines=False链接

import pandas as pd
import csv

test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)

print (test.shape)
#(381422, 22)
Run Code Online (Sandbox Code Playgroud)

但是一些数据(有问题的)将被跳过。

如果要跳过电子邮件正文数据,可以使用:

import pandas as pd
import csv

test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE,  sep=',', error_bad_lines=False, header=None,
    names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"])

print (test.shape)

#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']
Run Code Online (Sandbox Code Playgroud)