获取网页字符集的好方法,可靠的简短方法是什么？

Question

获取网页字符集的好方法,可靠的简短方法是什么？

我有点惊讶的是,使用Python获取网页的charset非常复杂.我错过了一条路吗？HTTPMessage有很多函数,但不是这个.

>>> google = urllib2.urlopen('http://www.google.com/')
>>> google.headers.gettype()
'text/html'
>>> google.headers.getencoding()
'7bit'
>>> google.headers.getcharset()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: HTTPMessage instance has no attribute 'getcharset'

Run Code Online (Sandbox Code Playgroud)

所以你必须得到标题,并拆分它.两次.

>>> google = urllib2.urlopen('http://www.google.com/')
>>> charset = 'ISO-8859-1'
>>> contenttype = google.headers.getheader('Content-Type', '')
>>> if ';' in contenttype:
...     charset = contenttype.split(';')[1].split('=')[1]
>>> charset
'ISO-8859-1'

Run Code Online (Sandbox Code Playgroud)

对于这样一个基本功能来说,这是一个惊人的步骤.我错过了什么吗？

Answer 1

Len*_*rri 6

你检查过这个吗？

如何在python中下载任何(!)网页和正确的字符集？

所以我错过了一些东西,即`.headers.getparam('charset')`,这简化了很多. (2认同)

Answer 2

Eli*_*ria 5

我做了一些研究并提出了这个解决方案：

response = urllib.request.urlopen(url)
encoding = response.headers.get_content_charset()

Run Code Online (Sandbox Code Playgroud)

这是我会怎么做它在Python 3。我没有在Python 2测试，但我猜，你将不得不使用urllib2.request替代urllib.request。

这是它的工作原理，因为官方 Python 文档并没有很好地解释它：结果urlopen是一个http.client.HTTPResponse对象。headers这个对象的属性是一个http.client.HTTPMessage对象，根据文档，它“是使用email.message.Message类实现的”，它有一个名为的方法get_content_charset，它试图确定并返回响应的字符集。

默认情况下，None如果无法确定字符集，则此方法返回，但您可以通过传递failobj参数来替代此行为：

encoding = response.headers.get_content_charset(failobj="utf-8")

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，8 月前
查看次数：	5056 次
最近记录：	11 年，2 月前