如何修复''UnicodeDecodeError:'charmap'编解码器无法解码位置29815中的字节0x9d:字符映射到<undefined>''?

use*_*027 10 python sqlite unicode file-io decode

目前,我正在尝试通过Spyder IDE/GUI使用填充了信息的文本文件对Python 3程序进行一些操作.但是,在尝试读取文件时,我收到以下错误:

  File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
    parser(f)

  File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
    data = infile.read()

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>
Run Code Online (Sandbox Code Playgroud)

该计划的代码如下:

import os

os.getcwd()

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)
Run Code Online (Sandbox Code Playgroud)

有人可以帮助我吗?提前致谢!!

小智 49

在open语句中添加编码例如:

f=open("filename.txt","r",encoding='utf-8')
Run Code Online (Sandbox Code Playgroud)


Gia*_*zzi 26

正如您在https://en.wikipedia.org/wiki/Windows-1252中看到的那样,代码0x9D未在CP1252中定义.

"错误"例如在您的open函数中:您没有指定编码,因此python(仅在Windows中)将使用一些系统编码.通常,如果您读取的文件可能不是在同一台机器上创建的,那么最好指定编码.

我建议你也open写一个编码csv 的编码.明确表达真的更好.

我不知道原始文件格式,但添加到open , encoding='utf-8'通常是一件好事(它是Linux和MacO中的默认值).

  • [当有人回答我的问题时我该怎么办?](https://stackoverflow.com/help/someone-answers) (2认同)

Rom*_*ano 16

以上对我不起作用,试试这个:创造, errors='ignore' 奇迹!

  • 同时使用 encoding='utf-8' 和 errors='ignore' 会更有意义 (6认同)
  • 隐藏错误通常是错误的做法。这仅在异常情况下才有意义,但更常见的是不懂编码的人在绝望中使用。现在是最终阅读 [每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有借口!)](https://www.joelonsoftware.com/2003/10/08/the -绝对最小每个软件开发人员绝对必须了解 unicode 和字符集,没有任何借口/) (2认同)

小智 8

file = open(filename, 'rb')如果您不需要解码它,您也可以尝试将'rb' 转换为读取二进制文件。假设您只想上传到网站


小智 6

error='ignore' 解决了我的头痛问题:

如何在目录和子目录中查找单词“coma”=

import os
rootdir=('K:\\0\\000.THU.EEG.nedc_tuh_eeg\\000edf.01_tcp_ar\\01_tcp_ar\\')
for folder, dirs, files in os.walk(rootdir):
    for file in files:
        if file.endswith('.txt'):
            fullpath = os.path.join(folder, file)
            with open(fullpath, 'r', errors='ignore') as f:
                for line in f:
                    if "coma" in line:
                        print(fullpath)
                        break
Run Code Online (Sandbox Code Playgroud)