Python从文件中读取并删除非ascii字符

Question

Python从文件中读取并删除非ascii字符

use*_*963 3 python encoding utf character-encoding

我有以下程序逐字读取文件并将该字再次写入另一个文件但没有第一个文件中的非ascii字符.

import unicodedata
import codecs
infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')
outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')


for line in infile.readlines():
    for word in line.split():
        outfile.write(word+" ")
    outfile.write("\n")

infile.close()
outfile.close()

Run Code Online (Sandbox Code Playgroud)

我面临的唯一问题是,使用此代码时,它不会向第二个文件(d_parsed)打印新行.任何线索？

Answer 1

jfs*_*jfs 7

codecs.open()不支持通用换行符,例如,在Windows上阅读时不会转换\r\n为通用换行符\n.

io.open()改为使用:

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \
     io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:
    for line in infile:
        print(*line.split(), file=outfile)

Run Code Online (Sandbox Code Playgroud)

顺便说一句,如果你想删除非ascii字符,你应该使用ascii而不是utf-8.

如果输入编码与ascii兼容(例如utf-8),那么您可以以二进制模式打开文件并使用bytes.translate()删除非ascii字符:

#!/usr/bin/env python
nonascii = bytearray(range(0x80, 0x100))
with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:
    for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)
        outfile.write(line.translate(None, nonascii))

Run Code Online (Sandbox Code Playgroud)

它没有像第一个代码示例那样规范化空格.

归档时间：	11 年，1 月前
查看次数：	9696 次
最近记录：	6 年，9 月前