换句话说,可以使用/<tag[^>]*>.*?<\/tag>/正则表达式来匹配tag不包含嵌套tag元素的html 元素吗?
例如(lt.html):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>greater than sign in attribute value</title>
</head>
<body>
<div>1</div>
<div title=">">2</div>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
正则表达式:
$ perl -nE"say $1 if m~<div[^>]*>(.*?)</div>~" lt.html
Run Code Online (Sandbox Code Playgroud)
和屏幕刮刀:
#!/usr/bin/env python
import sys
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(sys.stdin)
for div in soup.findAll('div'):
print div.string
$ python lt.py <lt.html
Run Code Online (Sandbox Code Playgroud)
两者都给出相同的输出:
1
">2
Run Code Online (Sandbox Code Playgroud)
预期产量:
1
2
Run Code Online (Sandbox Code Playgroud)
w3c说:
属性值是文本和字符引用的混合,除了文本不能包含模糊符号的附加限制.
我正在使用scrapy来尝试从Google学术搜索中获取一些我需要的数据.以下面的链接为例:http : //scholar.google.com/scholar?q=intitle%3Apython+xpath
现在,我想从这个页面上删除所有标题.我遵循的流程如下:
scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
Run Code Online (Sandbox Code Playgroud)
这给了我scrapy外壳,我在其中:
>>> sel.xpath('//h3[@class="gs_rt"]/a').extract()
[
u'<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.4438&rep=rep1&type=pdf"><b>Python </b>Paradigms for XML</a>',
u'<a href="https://svn.eecs.jacobs-university.de/svn/eecs/archive/bsc-2009/sbhushan.pdf">NCClient: A <b>Python </b>Library for NETCONF Clients</a>',
u'<a href="http://hal.archives-ouvertes.fr/hal-00759589/">PALSE: <b>Python </b>Analysis of Large Scale (Computer) Experiments</a>',
u'<a href="http://i.iinfo.cz/r2/kd/xmlprague2007.pdf#page=53"><b>Python </b>and XML</a>',
u'<a href="http://www.loadaveragezero.com/app/drx/Programming/Languages/Python/">drx: <b>Python </b>Programming Language [Computers: Programming: Languages: <b>Python</b>]-loadaverageZero</a>',
u'<a href="http://www.worldcolleges.info/sites/default/files/py10.pdf">XML and <b>Python </b>Tutorial</a>',
u'<a href="http://dl.acm.org/citation.cfm?id=2555791">Zato\u2014agile ESB, SOA, REST and cloud integrations in <b>Python</b></a>',
u'<a href="ftp://ftp.sybex.com/4021/4021index.pdf">XML Processing with Perl, <b>Python</b>, and PHP</a>',
u'<a href="http://books.google.com/books?hl=en&lr=&id=El4TAgAAQBAJ&oi=fnd&pg=PT8&dq=python+xpath&ots=RrFv0f_Y6V&sig=tSXzPJXbDi6KYnuuXEDnZCI7rDA"><b>Python </b>& XML</a>',
u'<a href="https://code.grnet.gr/projects/ncclient/repository/revisions/efed7d4cd5ac60cbb7c1c38646a6d6dfb711acc9/raw/docs/proposal.pdf">A <b>Python </b>Module for NETCONF …Run Code Online (Sandbox Code Playgroud) 我已将网页下载到html文件中.我想知道获取该页面内容的最简单方法是什么.根据内容,我的意思是我需要浏览器显示的字符串.
要明确:
输入:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Run Code Online (Sandbox Code Playgroud)
输出:
Page title This is paragraph one. This is paragraph two.
Run Code Online (Sandbox Code Playgroud)
放在一起:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
Run Code Online (Sandbox Code Playgroud)