Python中类似jquery的HTML解析?

Roy*_*ang 61 python jquery

是否有任何Python库允许我解析类似于jQuery的HTML文档?

即我希望能够使用CSS选择器语法从文档中获取任意节点集,读取其内容/属性等.

我之前使用的唯一的Python HTML解析库是BeautifulSoup,即使它很好,我仍然认为如果我有jQuery语法,我的解析会更快.:d

sys*_*out 59

如果你能流利使用BeautifulSoup,你可以在你的libs中添加soupselect.
Soupselect是BeautifulSoup的CSS选择器扩展.

用法:

>>> from BeautifulSoup import BeautifulSoup as Soup
>>> from soupselect import select
>>> import urllib
>>> soup = Soup(urllib.urlopen('http://slashdot.org/'))
>>> select(soup, 'div.title h3')
[<h3><span><a href='//science.slashdot.org/'>Science</a>:</span></h3>,
 <h3><a href='//slashdot.org/articles/07/02/28/0120220.shtml'>Star Trek</h3>,
..]
Run Code Online (Sandbox Code Playgroud)

  • 如果你有问题安装soupselect,你应该尝试提供她的https://github.com/syabro/soupselect的pip兼容版本:`sudo pip install https://github.com/syabro/soupselect/archive/master. zip` (10认同)
  • 它现在是来自美国汤4的bs4 (6认同)
  • 值得一提的是,Beautiful Soup 4已经整合了汤选项目,内置了对CSS选择器的支持.请参阅[发行说明](http://www.crummy.com/2012/03/14/0). (4认同)

Luk*_*ley 43

考虑PyQuery:

http://packages.python.org/pyquery/

>>> from pyquery import PyQuery as pq
>>> from lxml import etree
>>> import urllib
>>> d = pq("<html></html>")
>>> d = pq(etree.fromstring("<html></html>"))
>>> d = pq(url='http://google.com/')
>>> d = pq(url='http://google.com/', opener=lambda url: urllib.urlopen(url).read())
>>> d = pq(filename=path_to_html_file)
>>> d("#hello")
[<p#hello.hello>]
>>> p = d("#hello")
>>> p.html()
'Hello world !'
>>> p.html("you know <a href='http://python.org/'>Python</a> rocks")
[<p#hello.hello>]
>>> p.html()
u'you know <a href="http://python.org/">Python</a> rocks'
>>> p.text()
'you know Python rocks'
Run Code Online (Sandbox Code Playgroud)


eus*_*iro 7

BeautifulSoup,现在支持css selectors

import requests
from bs4 import BeautifulSoup as Soup
html = requests.get('/sf/ask/213590681/').content
soup = Soup(html)
Run Code Online (Sandbox Code Playgroud)

这个问题的标题

soup.select('h1.grid--cell :first-child')[0].text
Run Code Online (Sandbox Code Playgroud)

问题投票数

# first item 
soup.select_one('[itemprop="upvoteCount"]').text
Run Code Online (Sandbox Code Playgroud)

使用Python Requests获取 html 页面