撇号打印为 â\x80\x99

Question

撇号打印为 â\x80\x99

import requests
from bs4 import BeautifulSoup
import re

source_url = requests.get('http://www.nytimes.com/pages/business/index.html')
div_classes = {'class' :['ledeStory' , 'story']}
title_tags = ['h2','h3','h4','h5','h6']

source_text = source_url.text
soup = BeautifulSoup(source_text, 'html.parser')


stories = soup.find_all("div", div_classes)

h = []; h2 = []; h3 = []; h4 =[]

for x in range(len(stories)):

    for x2 in range(len(title_tags)):
        hold = []; hold2 = []
        hold = stories[x].find(title_tags[x2])

        if hold is not None:
            hold2 = hold.find('a')

            if hold2 is not None:
                hh = (((hold.text.strip('a'))).strip())
                h.append(hh)
                #h.append(re.sub(r'[^\x00-\x7f]',r'', ((hold.text.strip('a'))).strip()))
                #h2.append(hold2.get('href'))

    hold = []
    hold = stories[x].find('p')

    if hold is not None:
        h3.append(re.sub(r'[^\x00-\x7f]',r'',((hold.text.strip('p')).strip())))

    else:
        h3.append('None')


h4.append(h)
h4.append(h2)
h4.append(h3)
print(h4)

Run Code Online (Sandbox Code Playgroud)

嘿大家。我一直想抓取一些数据，当我注意到打印输出将 (') 替换为 (â\x80\x99) 时，我几乎完成了我的抓取。例如，包含“China's”的标题是“Chinaâ\x80\x99s”。我做了一些研究并尝试使用解码/编码（utf-8）但无济于事。它只会告诉我你不能在 str() 上运行解码。我尝试使用 re.sub() 这会让我删除 (â\x80\x99) 但不会让我用 (') 替换它因为我想使用自然语言处理来解释数据担心没有撇号将大大改变意义。将不胜感激，我觉得我遇到了这个问题。

Answer 1

Jon*_*ler 4

在ISO 8859-1和相关代码集中（有很多），\xc3\xa2有代码点0xE2。当您将三个字节 0xE2、0x80、0x99 解释为 UTF-8 编码时，该字符为 U+2019，右单引号（即 \xe2\x80\x99 或，与 \' 或\xe2\x80\xe2\x80\x99不同）\'\x94 你可能能也可能不能发现差异）。

\n\n

我认为您遇到困难的根源有几种可能性，其中任何一种或多种都可能是您遇到麻烦的根源：

\n\n

您的终端未设置为解释 UTF-8。
您的源代码应使用\'（U+0027，撇号）。
您使用的是 Python 2.x 而不是 Python 3.x，并且由于使用 Unicode (UTF-8) 而出现问题。与此相反（正如Cory Madden 指出的那样），代码以print(h4)Python 3 结尾，所以这可能不是问题所在。

\n\n

将引号更改为 ASCII 撇号可能是最简单的方法。

\n\n

另一方面，如果您从其他地方分析 HTML，您可能必须考虑您的脚本将如何处理 UTF-8。使用 Unicode U+20xx 范围中的引号是一种非常常见的选择；也许你的刮刀需要处理它？

\n

归档时间：	8 年，5 月前
查看次数：	2498 次
最近记录：	4 年，7 月前