使用lxml从html提取属性

Gan*_*alf 2 html python lxml

我用来lxml从html页面检索标签的属性。html页面的格式如下:

<div class="my_div">
    <a href="/foobar">
        <img src="my_img.png">
    </a>
</div>
Run Code Online (Sandbox Code Playgroud)

我用来检索<a>标记内的url 和相同标记内的src值的python脚本是这样的:<img><div>

from lxml import html 

...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = element.xpath('/@href')
    src = element.xpath('//img/@src')
Run Code Online (Sandbox Code Playgroud)

为什么我没有得到琴弦?

小智 5

You are using lxml so you are operating with lxml objects - HtmlElement instances. HtmlElement is nested from etree.Element: http://lxml.de/api/lxml.etree._Element-class.html, it have get method, that returns attrubute value. So the proper way for you is:

from lxml import html 

...
tree = html.fromstring(page.text)
for link_element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = link_element.get('href')
    image_element = href.find('img')
    if image_element:
        img_src = image_element.get('src') 
Run Code Online (Sandbox Code Playgroud)


Oli*_* W. 0

如果您将代码更改为:

from lxml import html 

...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = element.items()[0][1]  #gives you the value corresponding to the key "href"
    src = element.xpath('//img/@src')[0]
    print(href, src)
Run Code Online (Sandbox Code Playgroud)

你会得到你需要的。

的文档lxml提到了其中一些内容,但我觉得它缺少一些内容,您可能需要考虑使用交互式 python shell 来研究tree.xpath(). 或者您可以完全研究另一个解析器,例如BeautifulSoup,它有非常好的示例和文档。只是分享...