从Python中的Unicode Web Scrape输出ascii文件

cas*_*ova 3 python unicode

我是Python编程的新手.我在我的Python文件中使用以下代码:

import gethtml
import articletext
url = "http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece"
result = articletext.getArticle(url)
text_file = open("Output.txt", "w")

text_file.write(result)

text_file.close()
Run Code Online (Sandbox Code Playgroud)

该文件articletext.py包含以下代码:

from bs4 import BeautifulSoup
import gethtml
def getArticleText(webtext):
    articletext = ""
    soup = BeautifulSoup(webtext)
    for tag in soup.findAll('p'):
        articletext += tag.contents[0]
    return articletext

def getArticle(url):
    htmltext = gethtml.getHtmlText(url)
    return getArticleText(htmltext)
Run Code Online (Sandbox Code Playgroud)

但是我收到以下错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 473: ordinal not in range(128)
To print the result into the output file, what proper code should I write ?

The output `result` is text in the form of a paragraph.
Run Code Online (Sandbox Code Playgroud)

Aar*_*all 5

为了处理unicode错误,我们需要将文本编码为unicode(精确地说是UTF-8)而不是ascii.如果存在编码错误,为了确保它不会抛出错误,我们将忽略任何我们没有映射的字符.(您也可以使用str.encode提供的"替换"或其他选项.请参阅Unicode上的Python文档.)

打开文件的最佳做法是使用Python上下文管理器,即使出现错误也会关闭文件.我在路径中使用斜杠而不是反斜杠,以确保它适用于Windows或Unix/Linux.

text = text.encode('UTF-8', 'ignore')
with open('/temp/Out.txt', 'w') as file:
    file.write(text)
Run Code Online (Sandbox Code Playgroud)

这相当于

text = text.encode('UTF-8', 'ignore')
try:
    file = open('/temp/Out.txt', 'w')
    file.write(text)
finally:
    file.close()
Run Code Online (Sandbox Code Playgroud)

但是上下文管理器的冗长程度要小得多,并且不太容易导致您在错误中锁定文件.