如何在Python中获取Html页面的内容

Question

如何在Python中获取Html页面的内容

我已将网页下载到html文件中.我想知道获取该页面内容的最简单方法是什么.根据内容,我的意思是我需要浏览器显示的字符串.

要明确:

输入:

<html><head><title>Page title</title></head>
       <body><p id="firstpara" align="center">This is paragraph <b>one</b>.
       <p id="secondpara" align="blah">This is paragraph <b>two</b>.
       </html>

Run Code Online (Sandbox Code Playgroud)

输出:

Page title This is paragraph one. This is paragraph two.

Run Code Online (Sandbox Code Playgroud)

放在一起:

from BeautifulSoup import BeautifulSoup
import re

def removeHtmlTags(page):
    p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
    return p.sub('', page)

def removeHtmlTags2(page):
    soup = BeautifulSoup(page)
    return ''.join(soup.findAll(text=True))

Run Code Online (Sandbox Code Playgroud)

有关

Python HTML删除
使用Python从HTML文件中提取文本
什么是可以消除HTML标签的轻型python库？(只有文字)
删除AppEngine Python Env中的HTML标记(相当于Ruby的Sanitize)
RegEx匹配开放标签,除了XHTML自包含标签(着名的不使用正则表达式解析html rant)

Answer 1

Odd*_*ing 12

使用Beautiful Soup解析HTML .

要获取所有文本,没有标记,请尝试:

''.join(soup.findAll(text=True))

Run Code Online (Sandbox Code Playgroud)

Answer 2

the*_*Man 8

就个人而言,我使用lxml,因为它是一把瑞士军刀......

from lxml import html

print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()

这告诉lxml检索页面,找到<body>标签然后提取并打印所有文本.

我做了很多页面解析,大多数时候正则表达式是错误的解决方案,除非它只是一次性的需要.如果页面的作者更改了他们的HTML,那么你的正则表达式会有很大的风险.解析器更有可能继续工作.

解析器的一个大问题是学习如何访问您所在文档的各个部分,但是您可以在浏览器中使用许多XPATH工具来简化任务.

归档时间：	15 年，11 月前
查看次数：	10122 次
最近记录：	12 年，2 月前