使用 xpath 获取部分字符串匹配的 html 标签

Question

使用 xpath 获取部分字符串匹配的 html 标签

Nav*_*ava 4 python xpath lxml html-parsing

html 代码是盲目的，它在 html 中包含字符串“PRICE”。该部分字符串必须与 html 文本匹配。如果文本匹配（部分匹配）使用 xpath。它应该返回特定的 html 标记路径。

注意：我需要为多个站点自动执行此逻辑。我应该使用通用规则（用于定位“价格”，获取父标签）

这是示例：

html="""<div id = "price_id">
  <span id = "id1"></span>
  <div class="price_class">
   <bold>
   <strong>
   <label>PRICE:</label> 125 Rs.
   </bold>
   </strong>
   </br>
   </br>

</div>"""

Run Code Online (Sandbox Code Playgroud)

我用过lxml

 from lxml.html.clean import Cleaner     

 cleaner =Cleaner(page_structure=False)
 cl = cleaner.clean_html(html)
 cleaned_html = fromstring(cl)

 for element in cleaned_html:
      if element.text == 'PRICE':
          print "matched"

Run Code Online (Sandbox Code Playgroud)

如何使用 Xpath 表达式编写它？

我只需要使用 xpath 表达式获取 div 类路径。

另外问题是如果我找到“价格：”字符串。我应该获得父有效标签，即“div”，类名为“price_class”。但在这里我应该跳过或删除不需要的标签，如字体、粗体、斜体...

您能否建议我获取所定位字符串的父有效标签？

Answer 1

rec*_*dev 5

您可以使用ancestor轴：

import lxml.html

html = ...
doc = lxml.html.fromstring(html)

for element in doc.xpath('//label[contains(text(), "PRICE:")]/ancestor::div[@class="price_class"]'):
    print 'Found %s: %s' % (element.tag, element.text_content().strip())

Run Code Online (Sandbox Code Playgroud)

输出：

Found div: PRICE: 125 Rs.

Run Code Online (Sandbox Code Playgroud)

编辑：修改后的问题的更一般的解决方案：

doc.xpath('//*[contains(text(), "PRICE:")]/\
          ancestor::*[not(self::strong|self::bold|self::italic)][1]')

Run Code Online (Sandbox Code Playgroud)

它将搜索带有文本 " PRICE:"的元素，然后选择第一个跳过的祖先元素strong, bold, italic。您可以向排除列表添加更多标签。

您可以搜索第一个好的祖先（如、等）div，而不是排除列表ul：

doc.xpath('//*[contains(text(), "PRICE:")]/ancestor::*[self::div|self::ul][1]')

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，1 月前
查看次数：	6895 次
最近记录：	7 年前