使用XPath获取HTML元素的文本内容?

Gen*_*han 19 html xml xpath html-parsing

看到这个HTML

<div>
    <p>
    <span class="abc">Monitor</span> <b>$300</b>
    </p>
    <a href="/add">Add to cart</a>
</div>
<div>
    <p>
    <span class="abc">Keyboard</span> $20 
    </p>
    <a href="/add">Add to cart</a>
</div>
Run Code Online (Sandbox Code Playgroud)

使用xpath我想解析Monitor $300Keyboard $20.我用这个xpath

 //div[a[contains(., "Add to cart")]]/p/text()
Run Code Online (Sandbox Code Playgroud)

但它选择了<span class="abc">Monitor</span> <b>$300</b>.我不想要标签.我如何只获得文字?

Mar*_*ers 29

您想要选择所有后代文本,而不仅仅是子文本:

//div[a[contains(., "Add to cart")]]/p//text()
Run Code Online (Sandbox Code Playgroud)

注意之间的双斜线ptext()那里.

这可能还会包含很多标签间的空白,你需要清理它.示例使用lxml:

>>> import lxml.etree as ET
>>> tree = ET.fromstring('''<div>
... <div>
...     <p>
...     <span class="abc">Monitor</span> <b>$300</b>
...     </p>
...     <a href="/add">Add to cart</a>
... </div>
... <div>
...     <p>
...     <span class="abc">Keyboard</span> $20 
...     </p>
...     <a href="/add">Add to cart</a>
... </div>
... </div>''')
>>> tree.xpath('//div[a[contains(., "Add to cart")]]/p//text()')
['\n    ', 'Monitor', ' ', '$300', '\n    ', '\n    ', 'Keyboard', ' $20 \n    ']
>>> res = _
>>> [txt for txt in (txt.strip() for txt in res) if txt]
['Monitor', '$300', 'Keyboard', '$20']
Run Code Online (Sandbox Code Playgroud)

  • 哇!那双"//"节省了我的一天 (4认同)