sna*_*ies 2 python xml xpath lxml wildcard
我正在尝试使用xml从雅虎财务中划出"部门"和"行业"字段.
我注意到href url始终是http://biz.yahoo.com/ic/ xyz .html,其中xyz是数字.
您能否建议包含1位或更多位数的通配符?我已经尝试了几种基于Google和堆栈搜索的方法,但没有任何效果.
import lxml.html
url = 'http://finance.yahoo.com/q?s=AAPL'
root = lxml.html.parse(url).getroot()
for a in root.xpath('//a[@href="http://biz.yahoo.com/ic/' + 3 digit integer wildcard " +'.html"]')
print a.text
Run Code Online (Sandbox Code Playgroud)
纯XPath 1.0解决方案(无扩展功能):
//a[starts-with(@href, 'http://biz.yahoo.com/ic/')
and
substring(@href, string-length(@href)-4) = '.html'
and
string-length
(substring-before
(substring-after(@href, 'http://biz.yahoo.com/ic/'),
'.')
) = 3
and
translate(substring-before
(substring-after(@href, 'http://biz.yahoo.com/ic/'),
'.'),
'0123456789',
''
)
= ''
]
Run Code Online (Sandbox Code Playgroud)
这个XPath表达式可以像这样用"英语阅读":
a在文档中选择any ,其href属性的字符串值以字符串开头并以字符串"'http://biz.yahoo.com/ic/"结尾".html",并且起始和结束子字符串之间的子字符串长度为3,并且此相同的子字符串仅包含数字.
基于XSLT的验证:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"//a[starts-with(@href, 'http://biz.yahoo.com/ic/')
and
substring(@href, string-length(@href)-4) = '.html'
and
string-length
(substring-before
(substring-after(@href, 'http://biz.yahoo.com/ic/'),
'.')
) = 3
and
translate(substring-before
(substring-after(@href, 'http://biz.yahoo.com/ic/'),
'.'),
'0123456789',
''
)
= ''
]
"/>
</xsl:template>
</xsl:stylesheet>
Run Code Online (Sandbox Code Playgroud)
当此转换应用于以下XML文档时:
<html>
<body>
<a href="http://biz.yahoo.com/ic/123.html">Link1</a>
<a href="http://biz.yahoo.com/ic/1234.html">Incorrect</a>
<a href="http://biz.yahoo.com/ic/x23.html">Incorrect</a>
<a href="http://biz.yahoo.com/ic/621.html">Link2</a>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
评估XPath表达式并将选定的节点复制到输出:
<a href="http://biz.yahoo.com/ic/123.html">Link1</a>
<a href="http://biz.yahoo.com/ic/621.html">Link2</a>
Run Code Online (Sandbox Code Playgroud)
如我们所见,只选择了正确的,想要的a元素.