如果每行包含不同数量的字段(数字相当大),如何正确读取csv文件?

use*_*223 10 python csv pandas

我有来自亚马逊的文本文件,其中包含以下信息:

 #      user        item     time   rating     review text (the header is added by me for explanation, not in the text file
  disjiad123    TYh23hs9     13160032    5     I love this phone as it is easy to use
  hjf2329ccc    TGjsk123     14423321    3     Suck restaurant
Run Code Online (Sandbox Code Playgroud)

如您所见,数据按空格分隔,每行中有不同数量的列.但是,它是文本内容.这是我尝试过的代码:

pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part
Run Code Online (Sandbox Code Playgroud)

并出现这样的错误:

ValueError: Passed header names mismatches usecols
Run Code Online (Sandbox Code Playgroud)

当我尝试阅读所有列时:

pd.read_csv(filename, sep = " ", header = None)
Run Code Online (Sandbox Code Playgroud)

而这次的错误是:

Error tokenizing data. C error: Expected 229 fields in line 3, saw 320
Run Code Online (Sandbox Code Playgroud)

鉴于审核文本在很多行中都很长,因此在此问题中为每列添加标题名称的方法无效.

我想知道如果我想保留评论文本并分别跳过它们,如何阅读csv文件.先感谢您!

编辑:

Martin Evans完美地解决了这个问题.但是现在我正在玩另一个具有相似但不同格式的数据集.现在数据的顺序是相反的:

     # review text                          user        item     time   rating      (the header is added by me for explanation, not in the text file
   I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                           hjf2329ccc    TGjsk123     14423321    3     
Run Code Online (Sandbox Code Playgroud)

你有什么想法正确阅读吗?如有任何帮助,我们将不胜感激!

Mar*_*ans 13

如建议的那样,DictReader也可以按如下方式创建行列表.然后可以将其导入为pandas中的框架:

import pandas as pd
import csv

rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']

with open('input.csv', 'rb') as f_input:
    for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
        try:
            rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
        except KeyError, e:
            rows.append([row['user'], row['item'], row['rating'], ' '])

frame = pd.DataFrame(rows, columns=frame_header)
print frame
Run Code Online (Sandbox Code Playgroud)

这将显示以下内容:

         user      item rating                                  review
0  disjiad123  TYh23hs9      5  I love this phone as it is easy to use
1  hjf2329ccc  TGjsk123      3                         Suck restaurant
Run Code Online (Sandbox Code Playgroud)

如果审核出现在行的开头,那么一种方法是反向解析该行,如下所示:

import pandas as pd
import csv


rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']

with open('input.csv', 'rb') as f_input:
    for row in f_input:
        cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
        rows.append(cols[:4] + [' '.join(cols[4:][::-1])])

frame = pd.DataFrame(rows, columns=frame_header)
print frame
Run Code Online (Sandbox Code Playgroud)

这将显示:

  rating      time      item        user  \
0      5  13160032  TYh23hs9   isjiad123   
1      3  14423321  TGjsk123  hjf2329ccc   

                                    review  
0  I love this phone as it is easy to used  
1                          Suck restaurant  
Run Code Online (Sandbox Code Playgroud)

row[::-1]用于反转整行的文本,[2:]跳过现在位于行开头的行结束.然后在空格上分割每一行.然后,列表推导重新反转每个拆分条目.最后rows通过获取固定的5列条目(现在在开始时)附加到最后.然后将剩余的条目与空格连接在一起并添加为最终列.

这种方法的好处是它不依赖于您的输入数据是完全固定的宽度格式,并且您不必担心使用的列宽是否随时间而变化.


cht*_*mon 6

看起来这是一个固定宽度的文件。熊猫read_fwf为此目的提供用品。以下代码为我正确读取了文件。如果它不能很好地工作,则可能需要弄乱宽度。

pandas.read_fwf('test.fwf', 
                 widths=[13, 12, 13, 5, 100], 
                 names=['user', 'item', 'time', 'rating', 'review'])
Run Code Online (Sandbox Code Playgroud)

如果各列仍与编辑的版本对齐(排名靠前),则只需添加正确的规范。如下所示的指导方针有助于快速完成此任务:

0        1         2         3         4         5         6         7         8
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
  I love this phone as it is easy to used  isjiad123    TYh23hs9     13160032    5    
  Suck restaurant                          hjf2329ccc   TGjsk123     14423321    3     
Run Code Online (Sandbox Code Playgroud)

因此,新命令变为:

pandas.read_fwf('test.fwf', 
                colspecs=[[0, 43], [44, 56], [57, 69], [70, 79], [80, 84]], 
                names=['review', 'user', 'item', 'time', 'rating'])
Run Code Online (Sandbox Code Playgroud)