teu*_*oon 9 python file-io dataset
我想读一些非常大的文件(确切地说:谷歌ngram 1字数据集)并计算一个字符出现的次数.现在我写了这个脚本:
import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files):
line = line.strip()
data = line.split('\t')
for character in list(data[0]):
if (not character in charcounts):
charcounts[character] = 0
charcounts[character] += int(data[1])
if (fileinput.filename() is not lastfile):
print(fileinput.filename())
lastfile = fileinput.filename()
if(fileinput.filelineno() % 100000 == 0):
print(fileinput.filelineno())
print(charcounts)
Run Code Online (Sandbox Code Playgroud)
哪个工作正常,直到达到约.第一个文件的700.000行,然后我得到这个错误:
../../datasets/googlebooks-eng-all-1gram-20090715-0.csv
100000
200000
300000
400000
500000
600000
700000
Traceback (most recent call last):
File "charactercounter.py", line 5, in <module>
for line in fileinput.input(files):
File "C:\Python31\lib\fileinput.py", line 254, in __next__
line = self.readline()
File "C:\Python31\lib\fileinput.py", line 349, in readline
self._buffer = self._file.readlines(self._bufsize)
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7771: cha
racter maps to <undefined>
Run Code Online (Sandbox Code Playgroud)
为了解决这个问题,我在网上搜索了一下,然后想出了这段代码:
import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files,False,'',0,'r',fileinput.hook_encoded('utf-8')):
line = line.strip()
data = line.split('\t')
for character in list(data[0]):
if (not character in charcounts):
charcounts[character] = 0
charcounts[character] += int(data[1])
if (fileinput.filename() is not lastfile):
print(fileinput.filename())
lastfile = fileinput.filename()
if(fileinput.filelineno() % 100000 == 0):
print(fileinput.filelineno())
print(charcounts)
Run Code Online (Sandbox Code Playgroud)
但是我现在使用的钩子试图将整个990MB的文件一次性读入内存,这会让我的电脑崩溃.有谁知道如何重写这段代码,以便它真正起作用?
ps:代码还没有一直运行,所以我甚至不知道它是否做了它必须做的事情,但为了实现这一点,我首先需要修复这个bug.
哦,我使用的是Python 3.2
我不知道为什么fileinput没有按预期工作.
我建议你改用这个open功能.返回值可以迭代并返回行,就像fileinput一样.
代码将是这样的:
for filename in files:
print(filename)
for filelineno, line in enumerate(open(filename, encoding="utf-8")):
line = line.strip()
data = line.split('\t')
# ...
Run Code Online (Sandbox Code Playgroud)
一些文档链接:enumerate,open,io.TextIOWrapper(open返回TextIOWrapper的一个实例).