python：打开并读取一个包含日耳曼元音变音的文件作为unicode

Question

python：打开并读取一个包含日耳曼元音变音的文件作为unicode

Ami*_*min 1 python sqlite unicode utf-8 diacritics

我已经编写了我的程序来从文本文件中读取单词并将它们输入到 sqlite 数据库中，并将其视为字符串。但我需要输入一些包含日耳曼语 umlates 的单词：äöüß。

这是一段准备好的代码：

我用 # - - 编码：iso-8859-15 - - 和 # - - 编码：utf-8 - - 没有区别（！）

    # -*- coding: iso-8859-15 -*-
    import sqlite3

    dbname = 'sampledb.db'
    filename ='text.txt'


    con = sqlite3.connect(dbname)
    cur = con.cursor()
    cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,name)''')    

    #f=open(filename)
    #text = f.readlines()
    #f.close()

    text = u'süß'

    print (text)
    cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))       

    con.commit()

    sentence = "The name is: %s" %(text,)

    print (sentence)
    f.close()
    con.close()

Run Code Online (Sandbox Code Playgroud)

上面的代码运行良好。但我需要从包含“süß”一词的文件中读取“文本”。因此，当我取消注释 3 行（ f.open(filename) .... ）并注释text = u'süß' 时，会出现错误

    sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.

Run Code Online (Sandbox Code Playgroud)

我尝试使用编解码器模块读取 utf-8、iso-8859-15。但是我无法将它们解码为字符串“süß”，我需要在代码末尾完成我的句子。

一旦我在插入数据库之前尝试解码为 utf-8。它有效，但我不能将它用作字符串。

有没有办法可以从文件中导入 süß 并将其用于插入 sqlite 和用作字符串？

更多详情：

在这里，我添加了更多细节以供澄清。我以前用过codecs.open。包含单词süß的文本文件保存为utf-8. 使用f=codecs.open(filename, 'r', 'utf-8')and text=f.read()，我将文件读取为 unicode u'\ufeffs\xfc\xdf'。插入这个unicode在sqlite3的顺利完成：cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))。

问题在这里：sentence = "The name is: %s" %(text,)给出u'The name is: \ufeffs\xfc\xdf'，我还需要print(text)作为我的输出süß，同时print(text)带来这个错误UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>。

谢谢你。

Answer 1

Mar*_*som 5

当您打开并读取文件时，您会得到 8 位字符串而不是 Unicode。在 Python 2 中获取 Unicode 字符串改为使用codecs.open打开文件：

f=codecs.open(filename, 'r', 'utf-8')

Run Code Online (Sandbox Code Playgroud)

希望你已经转向 Python 3，在那里编码被放入常规open调用中。此外，除非您使用'b'二进制标志打开，否则您将始终获得 Unicode 字符串而不是 8 位二进制字符串，如果您不指定，将使用默认编码。

f=open(filename, 'r', encoding='utf-8')

Run Code Online (Sandbox Code Playgroud)

当然，根据文件的编写方式，您可能需要'iso-8859-15'改用。

编辑：您的测试代码和注释掉的代码之间的一大区别是从文件中读取会生成一个列表，而测试是一个字符串。也许您的问题根本与 Unicode 无关。尝试在您的测试代码中进行此替换，看看它是否会产生相同的错误：

text = [u'süß']

Run Code Online (Sandbox Code Playgroud)

不幸的是，我对 Python 中的 SQL 没有足够的经验来进一步帮助您。

此外，当您打印 alist而不是单个字符串时，Unicode 字符将被替换为其等效的转义序列。要查看字符串的真实情况，请一次打印一个。如果你想知道它的之间的区别__str__和__repr__。

编辑 2：该字符u'\ufeff'被称为字节顺序标记或 BOM，由一些编辑器插入以表明该文件是真正的 UTF-8。在使用字符串之前，您应该摆脱它。文件的开头应该只有一个。参见例如在 Python 中使用 BOM 字符读取 Unicode 文件数据

Answer 2

Ami*_*min 5

我可以解决这个问题。感谢您的帮助。

\n\n

这里是：

\n\n

# -*- coding: iso-8859-1 -*-\n\nimport sys \nimport codecs\nimport sqlite3\n\nf = codecs.open("suess_sweet.txt", "r", "utf-8")    # suess_sweet.txt file contains two\ntext_in_unicode = f.read()                          # comma-separated words: s\xc3\xbc\xc3\x9f, sweet \nf.close()\n\nstdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding()\n\ncon = sqlite3.connect(\'dict1.db\')\ncur = con.cursor()\ncur.execute(\'\'\'create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)\'\'\')    \n\n[ger,eng] = text_in_unicode.split(\',\')\n\ncur.execute(\'\'\'insert into table1 (id,German,English) VALUES (NULL,?,?)\'\'\',(ger,eng))       \n\ncon.commit()\n\nsentence = "The German word is: %s" %(ger,)\n\nprint sentence.encode(stdout_encoding)\n\ncon.close()\n

Run Code Online (Sandbox Code Playgroud)\n\n

我从此页面获得了一些帮助（德语）

\n\n

输出是：

\n\n

The German word is: ?s\xc3\xbc\xc3\x9f \n

Run Code Online (Sandbox Code Playgroud)\n\n

还有一个小问题是“？”。我以为unicodeu\'是编码后替换的?。sentence给出：

\n\n

>>> sentence\nu\'The German word is: \\ufeffs\\xfc\\xdf \'\n

Run Code Online (Sandbox Code Playgroud)\n\n

编码句子给出：

\n\n

>>> sentence.encode(stdout_encoding)\n\'The German word is: ?s\\xfc\\xdf \'\n

Run Code Online (Sandbox Code Playgroud)\n\n

所以这不是我想的那样。

\n\n

我想到了一个简单的解决方案，要消除问号就是使用替换函数：

\n\n

sentence = "The German word is: %s" %(ger,)\nto_print = sentence.encode(stdout_encoding)\nto_print = to_print.replace(\'?\',\'\')\n\n>>> print(to_print)\nThe German word is: s\xc3\xbc\xc3\x9f\n

Run Code Online (Sandbox Code Playgroud)\n\n

谢谢你这么：）

\n

归档时间：	11 年，11 月前
查看次数：	16851 次
最近记录：	4 年，9 月前