相关疑难解决方法(0)

TypeError:不能在re.findall()中的字节对象上使用字符串模式

我正在尝试学习如何从页面自动获取网址.在下面的代码中,我试图获取网页的标题:

import urllib.request
import re

url = "http://www.google.com"
regex = r'<title>(,+?)</title>'
pattern  = re.compile(regex)

with urllib.request.urlopen(url) as response:
   html = response.read()

title = re.findall(pattern, html)
print(title)

Run Code Online (Sandbox Code Playgroud)

我得到了这个意想不到的错误:

Traceback (most recent call last):
  File "path\to\file\Crawler.py", line 11, in <module>
    title = re.findall(pattern, html)
  File "C:\Python33\lib\re.py", line 201, in findall
    return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

Run Code Online (Sandbox Code Playgroud)

我究竟做错了什么？

python web-crawler python-3.x

Ins*_*lue

2019 04-09

90
推荐指数

2
解决办法

12万
查看次数

Lauch默认编辑器(如'webbrowser'模块)

有没有一种简单的方法可以从Python命令行工具中获取系统默认编辑器,比如webbrowser模块？

python command-line editor

pki*_*kit

lucky-day

13
推荐指数

1
解决办法

7778
查看次数

urllib.request.urlopen返回字节,但我无法解码它

我尝试使用解析网页urllib.request的urlopen()方法,如:

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

Run Code Online (Sandbox Code Playgroud)

但是,最后一行以字节为单位返回结果.

所以我尝试解码它,如:

html = urlopen(req).read().decode("utf-8")

Run Code Online (Sandbox Code Playgroud)

但是,发生了错误:

UnicodeDecodeError:'utf-8'编解码器无法解码位置1中的字节0x8b:无效的起始字节.

通过一些研究,我找到了一个相关的答案,解析charset决定解码.但是,该页面不会返回字符集,当我尝试在Chrome Web Inspector上进行检查时,其标题中会写入以下行:

<meta charset="utf-8">

Run Code Online (Sandbox Code Playgroud)

那么为什么我不能解码呢utf-8？我如何成功解析网页？

网站URL是http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2,我想将图像保存到我的磁盘.

请注意,我使用的是Python 3.5.1.我还注意到我上面写的所有工作在我的其他抓取程序中运行良好.

python decode urllib urlopen python-3.x

Bla*_*ard

2017 05-23

3
推荐指数

1
解决办法

3399
查看次数

标签统计

python ×3

python-3.x ×2

command-line ×1

decode ×1

editor ×1

urllib ×1

urlopen ×1

web-crawler ×1

TypeError:不能在re.findall()中的字节对象上使用字符串模式

Lauch默认编辑器(如'webbrowser'模块)

urllib.request.urlopen返回字节,但我无法解码它

标签 统计

标签统计