Python 2.7:编码为 UTF-8 时出现问题

bcl*_*man 1 python encoding utf

我有一个数据框,其中有一列_text,其中包含文章的文本。我试图获取数据框中每一行的文章长度。这是我的尝试:

from bs4 import BeautifulSoup
result_df['_text'] = [BeautifulSoup(text, "lxml").get_text() for text in result_df['_text']]

text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]
Run Code Online (Sandbox Code Playgroud)

不幸的是,我收到此错误:

    ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-8-f6c8ab83a46f> in <module>()
----> 1 text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 231: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

似乎我应该在某个地方指定“utf-8”,我只是不确定在哪里......

谢谢!

Ser*_*sta 5

我假设您使用 Python 2 版本,并且您的输入文本包含非 ASCII 字符。str(x)当 x 是 unicode 字符串时,默认情况下会出现以下问题:x.encode('ascii')

您有两种方法可以解决这个问题:

  1. 将 unicode 字符串正确编码为 utf-8:

    text_word_length = [len(x.encode('utf-8').split(" ")) for x in result_df['_text']]
    
    Run Code Online (Sandbox Code Playgroud)
  2. 将字符串拆分为 unicode:

    text_word_length = [len(x.split(u" ")) for x in result_df['_text']]
    
    Run Code Online (Sandbox Code Playgroud)