Python 2.7：编码为 UTF-8 时出现问题

Question

Python 2.7：编码为 UTF-8 时出现问题

我有一个数据框，其中有一列_text，其中包含文章的文本。我试图获取数据框中每一行的文章长度。这是我的尝试：

from bs4 import BeautifulSoup
result_df['_text'] = [BeautifulSoup(text, "lxml").get_text() for text in result_df['_text']]

text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]

Run Code Online (Sandbox Code Playgroud)

不幸的是，我收到此错误：

    ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-8-f6c8ab83a46f> in <module>()
----> 1 text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 231: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

似乎我应该在某个地方指定“utf-8”，我只是不确定在哪里......

谢谢！

Answer 1

Ser*_*sta 5

我假设您使用 Python 2 版本，并且您的输入文本包含非 ASCII 字符。str(x)当 x 是 unicode 字符串时，默认情况下会出现以下问题：x.encode('ascii')

您有两种方法可以解决这个问题：

将 unicode 字符串正确编码为 utf-8：
```
text_word_length = [len(x.encode('utf-8').split(" ")) for x in result_df['_text']]
```
Run Code Online (Sandbox Code Playgroud)
将字符串拆分为 unicode：
```
text_word_length = [len(x.split(u" ")) for x in result_df['_text']]
```
Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，7 月前
查看次数：	14723 次
最近记录：	8 年，7 月前