将\ xef \ xbb \ xbf拆分为从文件读取的列表

Question

将\ xef \ xbb \ xbf拆分为从文件读取的列表

Bak*_*ina 3 python stop-words python-2.7

我试图读取大数据file.txt并拆分所有逗号，点等，因此我在Python中使用以下代码读取了文件：

file= open("file.txt","r")
importantWords =[]
for i in file.readlines():
    line = i[:-1].split(" ")
    for word in line:
        for j in word:
            word = re.sub('[\!@#$%^&*-/,.;:]','',word)
            word.lower()
        if word not in stopwords.words('spanish'):
            importantWords.append(word)
print importantWords

Run Code Online (Sandbox Code Playgroud)

然后印出来['\xef\xbb\xbfdataText1', 'dataText2' .. 'dataTextn']。

我该如何清洁\xef\xbb\xbf？我正在使用Python 2.7。

Answer 1

fal*_*tru 5

它是UTF-8编码的BOM。

>>> import codecs
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'

Run Code Online (Sandbox Code Playgroud)

您可以使用codecs.openwith encoding='utf-8-sig'跳过BOM表序列：

with codecs.open("file.txt", "r", encoding="utf-8-sig") as f:
    for line in f:
        ...

Run Code Online (Sandbox Code Playgroud)

SIDENOTE：无需使用file.readlines，只需遍历文件即可。file.readlines如果只需要遍历文件，将创建不必要的临时列表。

归档时间：	10 年，2 月前
查看次数：	9434 次
最近记录：	10 年，2 月前