Python使用utf-8编码逐行读取大文件

teu*_*oon 9 python file-io dataset

我想读一些非常大的文件(确切地说:谷歌ngram 1字数据集)并计算一个字符出现的次数.现在我写了这个脚本:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)
Run Code Online (Sandbox Code Playgroud)

哪个工作正常,直到达到约.第一个文件的700.000行,然后我得到这个错误:

../../datasets/googlebooks-eng-all-1gram-20090715-0.csv
100000
200000
300000
400000
500000
600000
700000
Traceback (most recent call last):
  File "charactercounter.py", line 5, in <module>
    for line in fileinput.input(files):
  File "C:\Python31\lib\fileinput.py", line 254, in __next__
    line = self.readline()
  File "C:\Python31\lib\fileinput.py", line 349, in readline
    self._buffer = self._file.readlines(self._bufsize)
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7771: cha
racter maps to <undefined>
Run Code Online (Sandbox Code Playgroud)

为了解决这个问题,我在网上搜索了一下,然后想出了这段代码:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files,False,'',0,'r',fileinput.hook_encoded('utf-8')):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)
Run Code Online (Sandbox Code Playgroud)

但是我现在使用的钩子试图将整个990MB的文件一次性读入内存,这会让我的电脑崩溃.有谁知道如何重写这段代码,以便它真正起作用?

ps:代码还没有一直运行,所以我甚至不知道它是否做了它必须做的事情,但为了实现这一点,我首先需要修复这个bug.

哦,我使用的是Python 3.2

cod*_*ape 7

我不知道为什么fileinput没有按预期工作.

我建议你改用这个open功能.返回值可以迭代并返回行,就像fileinput一样.

代码将是这样的:

for filename in files:
    print(filename)
    for filelineno, line in enumerate(open(filename, encoding="utf-8")):
        line = line.strip()
        data = line.split('\t')
        # ...
Run Code Online (Sandbox Code Playgroud)

一些文档链接:enumerate,open,io.TextIOWrapper(open返回TextIOWrapper的一个实例).