相关疑难解决方法(0)

是否在> html-element属性值中允许">"(U + 003E GREATER-THAN SIGN)?

换句话说,可以使用/<tag[^>]*>.*?<\/tag>/正则表达式来匹配tag不包含嵌套tag元素的html 元素吗?

例如(lt.html):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <title>greater than sign in attribute value</title>
  </head>
  <body>
    <div>1</div>
    <div title=">">2</div>
  </body>
</html>
Run Code Online (Sandbox Code Playgroud)

正则表达式:

$ perl -nE"say $1 if m~<div[^>]*>(.*?)</div>~" lt.html
Run Code Online (Sandbox Code Playgroud)

和屏幕刮刀:

#!/usr/bin/env python
import sys
import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(sys.stdin)
for div in soup.findAll('div'):
    print div.string


$ python lt.py <lt.html
Run Code Online (Sandbox Code Playgroud)

两者都给出相同的输出:

1
">2
Run Code Online (Sandbox Code Playgroud)

预期产量:

1
2
Run Code Online (Sandbox Code Playgroud)

w3c说:

属性值是文本和字符引用的混合,除了文本不能包含模糊符号的附加限制.

html regex syntax

8
推荐指数
1
解决办法
3169
查看次数

使用美丽的汤来清理scrapy中的HTML

我正在使用scrapy来尝试从Google学术搜索中获取一些我需要的数据.以下面的链接为例:http : //scholar.google.com/scholar?q=intitle%3Apython+xpath

现在,我想从这个页面上删除所有标题.我遵循的流程如下:

scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
Run Code Online (Sandbox Code Playgroud)

这给了我scrapy外壳,我在其中:

>>> sel.xpath('//h3[@class="gs_rt"]/a').extract()

[
 u'<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.4438&amp;rep=rep1&amp;type=pdf"><b>Python </b>Paradigms for XML</a>', 
 u'<a href="https://svn.eecs.jacobs-university.de/svn/eecs/archive/bsc-2009/sbhushan.pdf">NCClient: A <b>Python </b>Library for NETCONF Clients</a>', 
 u'<a href="http://hal.archives-ouvertes.fr/hal-00759589/">PALSE: <b>Python </b>Analysis of Large Scale (Computer) Experiments</a>', 
 u'<a href="http://i.iinfo.cz/r2/kd/xmlprague2007.pdf#page=53"><b>Python </b>and XML</a>', 
 u'<a href="http://www.loadaveragezero.com/app/drx/Programming/Languages/Python/">drx: <b>Python </b>Programming Language [Computers: Programming: Languages: <b>Python</b>]-loadaverageZero</a>', 
 u'<a href="http://www.worldcolleges.info/sites/default/files/py10.pdf">XML and <b>Python </b>Tutorial</a>', 
 u'<a href="http://dl.acm.org/citation.cfm?id=2555791">Zato\u2014agile ESB, SOA, REST and cloud integrations in <b>Python</b></a>', 
 u'<a href="ftp://ftp.sybex.com/4021/4021index.pdf">XML Processing with Perl, <b>Python</b>, and PHP</a>', 
 u'<a href="http://books.google.com/books?hl=en&amp;lr=&amp;id=El4TAgAAQBAJ&amp;oi=fnd&amp;pg=PT8&amp;dq=python+xpath&amp;ots=RrFv0f_Y6V&amp;sig=tSXzPJXbDi6KYnuuXEDnZCI7rDA"><b>Python </b>&amp; XML</a>', 
 u'<a href="https://code.grnet.gr/projects/ncclient/repository/revisions/efed7d4cd5ac60cbb7c1c38646a6d6dfb711acc9/raw/docs/proposal.pdf">A <b>Python </b>Module for NETCONF …
Run Code Online (Sandbox Code Playgroud)

xpath scrapy

5
推荐指数
1
解决办法
4067
查看次数

如何在Python中获取Html页面的内容

我已将网页下载到html文件中.我想知道获取该页面内容的最简单方法是什么.根据内容,我的意思是我需要浏览器显示的字符串.

要明确:

输入:

<html><head><title>Page title</title></head>
       <body><p id="firstpara" align="center">This is paragraph <b>one</b>.
       <p id="secondpara" align="blah">This is paragraph <b>two</b>.
       </html>
Run Code Online (Sandbox Code Playgroud)

输出:

Page title This is paragraph one. This is paragraph two.
Run Code Online (Sandbox Code Playgroud)

放在一起:

from BeautifulSoup import BeautifulSoup
import re

def removeHtmlTags(page):
    p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
    return p.sub('', page)

def removeHtmlTags2(page):
    soup = BeautifulSoup(page)
    return ''.join(soup.findAll(text=True))
Run Code Online (Sandbox Code Playgroud)

有关

html python parsing

4
推荐指数
2
解决办法
1万
查看次数

标签 统计

html ×2

parsing ×1

python ×1

regex ×1

scrapy ×1

syntax ×1

xpath ×1