使用lxml解析html - 如何指定1到3位的通配符以使我的代码不那么脆弱?

sna*_*ies 2 python xml xpath lxml wildcard

我正在尝试使用xml从雅虎财务中划出"部门"和"行业"字段.

我注意到href url始终是http://biz.yahoo.com/ic/ xyz .html,其中xyz是数字.

您能否建议包含1位或更多位数的通配符?我已经尝试了几种基于Google和堆栈搜索的方法,但没有任何效果.

import lxml.html
url = 'http://finance.yahoo.com/q?s=AAPL'
root = lxml.html.parse(url).getroot()
for a in root.xpath('//a[@href="http://biz.yahoo.com/ic/' + 3 digit integer wildcard "     +'.html"]')
    print a.text
Run Code Online (Sandbox Code Playgroud)

Dim*_*hev 5

纯XPath 1.0解决方案(无扩展功能):

//a[starts-with(@href, 'http://biz.yahoo.com/ic/')
  and
    substring(@href, string-length(@href)-4) = '.html'
  and
    string-length
      (substring-before
          (substring-after(@href, 'http://biz.yahoo.com/ic/'), 
           '.')
      ) = 3
  and
    translate(substring-before
               (substring-after(@href, 'http://biz.yahoo.com/ic/'), 
                '.'),
              '0123456789',
              ''
              )
     = ''
   ]
Run Code Online (Sandbox Code Playgroud)

这个XPath表达式可以像这样用"英语阅读":

a在文档中选择any ,其href属性的字符串值以字符串开头并以字符串"'http://biz.yahoo.com/ic/"结尾".html",并且起始和结束子字符串之间的子字符串长度为3,并且此相同的子字符串仅包含数字.

基于XSLT的验证:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "//a[starts-with(@href, 'http://biz.yahoo.com/ic/')
      and
        substring(@href, string-length(@href)-4) = '.html'
      and
        string-length
          (substring-before
              (substring-after(@href, 'http://biz.yahoo.com/ic/'),
               '.')
          ) = 3
      and
        translate(substring-before
                   (substring-after(@href, 'http://biz.yahoo.com/ic/'),
                    '.'),
                  '0123456789',
                  ''
                  )
         = ''
       ]
   "/>
 </xsl:template>
</xsl:stylesheet>
Run Code Online (Sandbox Code Playgroud)

当此转换应用于以下XML文档时:

<html>
  <body>
    <a href="http://biz.yahoo.com/ic/123.html">Link1</a>
    <a href="http://biz.yahoo.com/ic/1234.html">Incorrect</a>
    <a href="http://biz.yahoo.com/ic/x23.html">Incorrect</a>
    <a href="http://biz.yahoo.com/ic/621.html">Link2</a>
  </body>
</html>
Run Code Online (Sandbox Code Playgroud)

评估XPath表达式并将选定的节点复制到输出:

<a href="http://biz.yahoo.com/ic/123.html">Link1</a>
<a href="http://biz.yahoo.com/ic/621.html">Link2</a>
Run Code Online (Sandbox Code Playgroud)

如我们所见,只选择了正确的,想要的a元素.