在Python 3中re.findall

Question

在Python 3中re.findall

我想使用函数re.findall(),它在网页中搜索某个模式:

from urllib.request import Request, urlopen
import re


url = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/20.0.1'})
webpage = urlopen(url).read()

findrows = re.compile('<td class="cmeTableCenter">(.*)</td>')
row_array = re.findall(findrows, webpage) #ERROR HERE

Run Code Online (Sandbox Code Playgroud)

我收到一个错误:

TypeError: can't use a string pattern on a bytes-like object

Run Code Online (Sandbox Code Playgroud)

Answer 1

Cai*_*von 5

urllib.request.urlopen返回一个bytes对象,而不是(Unicode)字符串.在尝试匹配任何内容之前,您应该解码它.例如,如果您知道您的页面是UTF-8:

webpage = urlopen(url).read().decode('utf8')

Run Code Online (Sandbox Code Playgroud)

更好的HTTP库将自动为您执行此操作,但确定正确的编码并不总是微不足道甚至可能,因此Python的标准库不会.

另一种选择是使用bytes正则表达式:

findrows = re.compile(b'<td class="cmeTableCenter">(.*)</td>')

Run Code Online (Sandbox Code Playgroud)

如果您不知道编码,并且不介意在bytes整个代码中使用对象,这将非常有用.

归档时间：	13 年前
查看次数：	7115 次
最近记录：	13 年前