我用来lxml从html页面检索标签的属性。html页面的格式如下:
<div class="my_div">
<a href="/foobar">
<img src="my_img.png">
</a>
</div>
Run Code Online (Sandbox Code Playgroud)
我用来检索<a>标记内的url 和相同标记内的src值的python脚本是这样的:<img><div>
from lxml import html
...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
href = element.xpath('/@href')
src = element.xpath('//img/@src')
Run Code Online (Sandbox Code Playgroud)
为什么我没有得到琴弦?
小智 5
You are using lxml so you are operating with lxml objects - HtmlElement instances. HtmlElement is nested from etree.Element: http://lxml.de/api/lxml.etree._Element-class.html, it have get method, that returns attrubute value. So the proper way for you is:
from lxml import html
...
tree = html.fromstring(page.text)
for link_element in tree.xpath('//div[contains(@class, "my_div")]//a'):
href = link_element.get('href')
image_element = href.find('img')
if image_element:
img_src = image_element.get('src')
Run Code Online (Sandbox Code Playgroud)
如果您将代码更改为:
from lxml import html
...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
href = element.items()[0][1] #gives you the value corresponding to the key "href"
src = element.xpath('//img/@src')[0]
print(href, src)
Run Code Online (Sandbox Code Playgroud)
你会得到你需要的。
的文档lxml提到了其中一些内容,但我觉得它缺少一些内容,您可能需要考虑使用交互式 python shell 来研究tree.xpath(). 或者您可以完全研究另一个解析器,例如BeautifulSoup,它有非常好的示例和文档。只是分享...
| 归档时间: |
|
| 查看次数: |
8056 次 |
| 最近记录: |