使用python获取网页正文中的内容

Question

使用python获取网页正文中的内容

我正在尝试使用python扫描各种网站。以下代码对我来说很好用。

import urllib
import re
htmlfile =urllib.urlopen("http://google.com")
htmltext=htmlfile.read()
regex='<title>(.+?)</title>'
pattern=re.compile(regex)
title= re.findall(pattern,htmltext)
print title

Run Code Online (Sandbox Code Playgroud)

为了获取正文内容，我进行了如下更改：

import urllib
import re
htmlfile =urllib.urlopen("http://google.com")
htmltext=htmlfile.read()
regex='<body>(.+?)</body>'
pattern=re.compile(regex)
title= re.findall(pattern,htmltext)
print title

Run Code Online (Sandbox Code Playgroud)

上面的代码给了我一个空的方括号。我不知道我在做什么错。请帮忙

Answer 1

rec*_*gle 6

通常，尝试使用正则表达式解析HTML是一个坏主意。

出色的美丽汤库使您尝试做的事情变得微不足道。

import bs4

html = '''
<head>
</head>
<body>
  <div></div>
</body>
'''

print(bs4.BeautifulSoup(html).find('body'))

Run Code Online (Sandbox Code Playgroud)

Python 在其标准库中也有一个HTML解析器，它基本上是漂亮的汤解析器的一个功能较少的版本。

如果您仍然坚持使用正则表达式，则应该可以使用。

import re
print(re.findall('<body>(.*?)</body>', html, re.DOTALL))

Run Code Online (Sandbox Code Playgroud)

同样，这听起来可能很愚蠢，但是请确保htmltext字符串中实际上有body标签。

归档时间：	11 年，10 月前
查看次数：	3964 次
最近记录：	11 年，10 月前