Dmy*_*hyi 9 html css python xpath beautifulsoup
我想从HTML中提取一些数据,然后能够在客户端突出显示提取的元素,而无需修改源HTML.而XPath或CSS Path看起来很棒.可以直接从BeautifulSoup中提取XPATH或CSS路径吗?
现在我使用target元素的标记,然后使用lxml lib来提取xpath,这对性能非常不利.我知道BSXPath.py- 它不适用于BS4.由于复杂性,重写所有使用本机lxml lib的解决方案是不可接受的.
import bs4
import cStringIO
import random
from lxml import etree
def get_xpath(soup, element):
_id = random.getrandbits(32)
for e in soup():
if e == element:
e['data-xpath'] = _id
break
else:
raise LookupError('Cannot find {} in {}'.format(element, soup))
content = unicode(soup)
doc = etree.parse(cStringIO.StringIO(content), etree.HTMLParser())
element = doc.xpath('//*[@data-xpath="{}"]'.format(_id))
assert len(element) == 1
element = element[0]
xpath = doc.getpath(element)
return xpath
soup = bs4.BeautifulSoup('<div id=i>hello, <b id=i test=t>world!</b></div>')
xpath = get_xpath(soup, soup.div.b)
assert '//html/bodydiv/b' == xpath
Run Code Online (Sandbox Code Playgroud)
实际上很容易提取简单的CSS/XPath.这与lxml lib相同.
def get_element(node):
# for XPATH we have to count only for nodes with same type!
length = len(list(node.previous_siblings)) + 1
if (length) > 1:
return '%s:nth-child(%s)' % (node.name, length)
else:
return node.name
def get_css_path(node):
path = [get_element(node)]
for parent in node.parents:
if parent.name == 'body':
break
path.insert(0, get_element(parent))
return ' > '.join(path)
soup = bs4.BeautifulSoup('<div></div><div><strong><i>bla</i></strong></div>')
assert get_css_path(soup.i) == 'div:nth-child(2) > strong > i'
Run Code Online (Sandbox Code Playgroud)