使用具有重音和不同字符的 Beautiful Soup

Question

使用具有重音和不同字符的 Beautiful Soup

Bri*_*son 5 python unicode beautifulsoup python-2.7

我正在使用 Beautiful Soup 来提取过去奥运会的奖牌获得者。一些赛事和运动员姓名中使用重音的问题令人困惑。我在网上看到过类似的问题，但我是 Python 新手，无法将它们应用到我的代码中。

\n\n

如果我打印我的汤，重音就会显得很好。但是当我开始解析汤（并将其写入 CSV 文件）时，重音字符会变成乱码。\n\'Louis Perr\xc3\xa9e\' 变为 \'Louis Perr\xe2\x88\x9a\xc2\xa9e\'

\n\n

from BeautifulSoup import BeautifulSoup\nimport urllib2\n\nresponse = urllib2.urlopen(\'http://www.databaseolympics.com/sport/sportevent.htm?sp=FEN&enum=130\')\nhtml = response.read()\nsoup = BeautifulSoup(html)\n\ng = open(\'fencing_medalists.csv\',\'w"\')\nt = soup.findAll("table", {\'class\' : \'pt8\'})\n\nfor table in t:\n    rows = table.findAll(\'tr\')\n    for tr in rows:\n        cols = tr.findAll(\'td\')\n        for td in cols:\n            theText=str(td.find(text=True))\n            #theText=str(td.find(text=True)).encode("utf-8")\n            if theText!="None":\n                g.write(theText)\n            else: \n                g.write("")\n            g.write(",")\n        g.write("\\n")\n

Run Code Online (Sandbox Code Playgroud)\n\n

非常感谢您的帮助。

\n

Answer 1

Imr*_*ran 4

如果您正在处理 unicode，请始终将从磁盘或网络读取的响应视为字节包而不是字符串。

CSV 文件中的文本可能是 utf-8 编码的，应首先对其进行解码。

import codecs
# ...
content = response.read()
html = codecs.decode(content, 'utf-8')

Run Code Online (Sandbox Code Playgroud)

此外，您还需要将 unicode 文本编码为 utf-8，然后再将其写入输出文件。用于codecs.open打开输出文件，指定编码。它将透明地为您处理输出编码。

g = codecs.open('fencing_medalists.csv', 'wb', encoding='utf-8')

Run Code Online (Sandbox Code Playgroud)

并对字符串写入代码进行以下更改：

    theText = td.find(text=True)
    if theText is not None:
        g.write(unicode(theText))

Run Code Online (Sandbox Code Playgroud)

编辑： BeautifulSoup 可能会自动进行 unicode 解码，因此您可以跳过codecs.decodeon 响应。

归档时间：	13 年，4 月前
查看次数：	3475 次
最近记录：	13 年，4 月前