Car*_*to_ 0 python csv comparison
我有一个生成CSV的Python脚本(从网站解析的数据).以下是CSV文件的示例:
File1.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
China;Beijing;BeiwaiOnline BFSU;;;
Italy;Curno;Bergamo, Anderson House;;Yes;
Run Code Online (Sandbox Code Playgroud)
File2.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
This;Is;A;New;Line;;
Italy;Curno;Bergamo, Anderson House;;Yes;
Run Code Online (Sandbox Code Playgroud)
如你看到的,
中国;北京;北外在线BFSU ;;; ==>来自File1.csv的这一行在File2.csv和This; Is; A; New; Line ;;中不再存在.==> File2.csv中的这一行是新的(在File1.csv中不存在).
我正在寻找一种方法来比较这两个CSV文件(一个重要的事情是要知道线的顺序不计算......它们不能在任何地方).
我想要的是一个可以告诉我的脚本: - 一个新行:这个;是; A;新;行;; - 一条拆除线:中国;北京;北外在线BFSU ;;; 等等 ... !
我试过但没有成功:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
now = [row for row in c2]
past = [row for row in c1]
for row in now:
#print row
lol = past.index(row)
print lol
f1.close()
f2.close()
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Run Code Online (Sandbox Code Playgroud)
想知道最好的方法吗?非常感谢你提前;)
编辑:
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
s1 = set(c1)
s2 = set(c2)
lol = s1 - s2
print type(lol)
print lol
Run Code Online (Sandbox Code Playgroud)
这似乎是个好主意,但是:
Traceback (most recent call last):
File "compare.py", line 20, in <module>
s1 = set(c1)
TypeError: unhashable type: 'list'
Run Code Online (Sandbox Code Playgroud)
编辑2(请不要关心上面的内容): *在您的帮助下,这是我写的脚本:*
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import csv
### COMPARISON THING ###
x=0
fichiers = os.listdir('/me/CSV')
for fichier in fichiers:
if '.csv' in fichier:
print('%s -----> %s' % (x,fichier))
x=x+1
choice = raw_input("Which file do you want to compare with the new output ? ->>>")
past_file = fichiers[int(choice)]
print 'We gonna compare %s to our output' % past_file
s_now = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/now.csv', 'r'), delimiter=';')) ## OUR OUTPUT
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
c = csv.writer(open("CHANGELOG.csv", "a"),delimiter=";" )
line = ['AD']
for item_added in added:
line.append(item_added)
c.writerow(['AD',item_added])
line = ['RM']
for item_removed in removed:
line.append(item_removed)
c.writerow(line)
Run Code Online (Sandbox Code Playgroud)
两种错误:
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: line contains NULL byte
Run Code Online (Sandbox Code Playgroud)
要么
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: newline inside string
Run Code Online (Sandbox Code Playgroud)
它在几分钟前工作但我已经改变了CSV文件以测试不同的数据,我在这里:-)
抱歉,最后一个问题!
如果您的数据不是非常大,将它们加载到集合(或冻结集)中将是一种简单的方法:
s_now = frozenset(tuple(row) for row in csv.reader(open('now.csv', 'r'), delimiter=';'))
s_past = frozenset(tuple(row) for row in csv.reader(open('past.csv', 'r'), delimiter=';'))
Run Code Online (Sandbox Code Playgroud)
要获取已添加的条目列表:
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
# Or, simply "added = list(s_now - s_past)" to keep them as tuples.
Run Code Online (Sandbox Code Playgroud)
同样,删除的条目列表:
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
Run Code Online (Sandbox Code Playgroud)
为了解决您所看到的原因的更新问题TypeError: unhashable type: 'list',csv将每个条目作为list迭代时间返回.lists是不可清洗的,因此不能插入set.
要解决此问题,您需要在添加到集合之前将list条目转换为tuples.请参阅我的答案中的上一部分,了解如何完成此操作的示例.
为了解决您看到的其他错误,它们都归因于您的CSV文件的内容.
_csv.Error:字符串中的换行符
看起来你"在数据中的某个地方引用了字符(),这会混淆解析器.我对CSV模块不太熟悉,无法告诉你究竟出了什么问题,无论如何都看不出你的数据.
但我设法重现错误:
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";")]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: newline inside string
Run Code Online (Sandbox Code Playgroud)
在这种情况下,可以通过指示读者不要使用引号进行任何特殊处理来修复它(请参阅csv.QUOTE_NONE).(请注意,这将禁用对引用数据的处理,从而分隔符可以出现在带引号的字符串中,而不会将字符串拆分为单独的条目.)
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";", quoting=csv.QUOTE_NONE)]
[['hello', 'wo', '"rld']]
Run Code Online (Sandbox Code Playgroud)
_csv.Error:行包含NULL字节
我猜这可能归结为你的CSV文件的编码.请参阅以下问题:
| 归档时间: |
|
| 查看次数: |
2175 次 |
| 最近记录: |