use*_*223 10 python csv pandas
我有来自亚马逊的文本文件,其中包含以下信息:
# user item time rating review text (the header is added by me for explanation, not in the text file
disjiad123 TYh23hs9 13160032 5 I love this phone as it is easy to use
hjf2329ccc TGjsk123 14423321 3 Suck restaurant
Run Code Online (Sandbox Code Playgroud)
如您所见,数据按空格分隔,每行中有不同数量的列.但是,它是文本内容.这是我尝试过的代码:
pd.read_csv(filename, sep = " ", header = None, names = ["user","item","time","rating", "review"], usecols = ["user", "item", "rating"])#I'd like to skip the text review part
Run Code Online (Sandbox Code Playgroud)
并出现这样的错误:
ValueError: Passed header names mismatches usecols
Run Code Online (Sandbox Code Playgroud)
当我尝试阅读所有列时:
pd.read_csv(filename, sep = " ", header = None)
Run Code Online (Sandbox Code Playgroud)
而这次的错误是:
Error tokenizing data. C error: Expected 229 fields in line 3, saw 320
Run Code Online (Sandbox Code Playgroud)
鉴于审核文本在很多行中都很长,因此在此问题中为每列添加标题名称的方法无效.
我想知道如果我想保留评论文本并分别跳过它们,如何阅读csv文件.先感谢您!
编辑:
Martin Evans完美地解决了这个问题.但是现在我正在玩另一个具有相似但不同格式的数据集.现在数据的顺序是相反的:
# review text user item time rating (the header is added by me for explanation, not in the text file
I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5
Suck restaurant hjf2329ccc TGjsk123 14423321 3
Run Code Online (Sandbox Code Playgroud)
你有什么想法正确阅读吗?如有任何帮助,我们将不胜感激!
Mar*_*ans 13
如建议的那样,DictReader也可以按如下方式创建行列表.然后可以将其导入为pandas中的框架:
import pandas as pd
import csv
rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']
with open('input.csv', 'rb') as f_input:
for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
try:
rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
except KeyError, e:
rows.append([row['user'], row['item'], row['rating'], ' '])
frame = pd.DataFrame(rows, columns=frame_header)
print frame
Run Code Online (Sandbox Code Playgroud)
这将显示以下内容:
user item rating review
0 disjiad123 TYh23hs9 5 I love this phone as it is easy to use
1 hjf2329ccc TGjsk123 3 Suck restaurant
Run Code Online (Sandbox Code Playgroud)
如果审核出现在行的开头,那么一种方法是反向解析该行,如下所示:
import pandas as pd
import csv
rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']
with open('input.csv', 'rb') as f_input:
for row in f_input:
cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
rows.append(cols[:4] + [' '.join(cols[4:][::-1])])
frame = pd.DataFrame(rows, columns=frame_header)
print frame
Run Code Online (Sandbox Code Playgroud)
这将显示:
rating time item user \
0 5 13160032 TYh23hs9 isjiad123
1 3 14423321 TGjsk123 hjf2329ccc
review
0 I love this phone as it is easy to used
1 Suck restaurant
Run Code Online (Sandbox Code Playgroud)
row[::-1]用于反转整行的文本,[2:]跳过现在位于行开头的行结束.然后在空格上分割每一行.然后,列表推导重新反转每个拆分条目.最后rows通过获取固定的5列条目(现在在开始时)附加到最后.然后将剩余的条目与空格连接在一起并添加为最终列.
这种方法的好处是它不依赖于您的输入数据是完全固定的宽度格式,并且您不必担心使用的列宽是否随时间而变化.
看起来这是一个固定宽度的文件。熊猫read_fwf为此目的提供用品。以下代码为我正确读取了文件。如果它不能很好地工作,则可能需要弄乱宽度。
pandas.read_fwf('test.fwf',
widths=[13, 12, 13, 5, 100],
names=['user', 'item', 'time', 'rating', 'review'])
Run Code Online (Sandbox Code Playgroud)
如果各列仍与编辑的版本对齐(排名靠前),则只需添加正确的规范。如下所示的指导方针有助于快速完成此任务:
0 1 2 3 4 5 6 7 8
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5
Suck restaurant hjf2329ccc TGjsk123 14423321 3
Run Code Online (Sandbox Code Playgroud)
因此,新命令变为:
pandas.read_fwf('test.fwf',
colspecs=[[0, 43], [44, 56], [57, 69], [70, 79], [80, 84]],
names=['review', 'user', 'item', 'time', 'rating'])
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
13448 次 |
| 最近记录: |