MrC*_*tro 5 python rss lxml xml-parsing
HIA
我在python中从stackexchange解析rss feed时遇到问题.当我尝试获取摘要节点时,将返回一个空列表
我一直试图解决这个问题,但无法理解.
任何人都可以帮忙吗?谢谢
In [3o]: import lxml.etree, urllib2
In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds'
In [32]: cooking_content = urllib2.urlopen(url_cooking)
In [33]: cooking_parsed = lxml.etree.parse(cooking_content)
In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')
In [35]: cooking_texts
Out[35]: []
Run Code Online (Sandbox Code Playgroud)
In [3o]: import lxml.etree, urllib2
In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds'
In [32]: cooking_content = urllib2.urlopen(url_cooking)
In [33]: cooking_parsed = lxml.etree.parse(cooking_content)
In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')
In [35]: cooking_texts
Out[35]: []
Run Code Online (Sandbox Code Playgroud)
看看这两个版本
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
Run Code Online (Sandbox Code Playgroud)
正如您所发现的,第二个版本不返回节点,但lxml.html版本工作正常.该etree版本无法正常工作,因为它期望命名空间,并且html版本正在运行,因为它忽略了命名空间.在http://lxml.de/lxmlhtml.html中,它说"HTML解析器显着忽略了命名空间和其他一些XML主义".
请注意,当您打印etree版本(print(data.getroot()))的根节点时,您会得到类似的结果<Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>.这意味着它是一个名称空间为的feed元素http://www.w3.org/2005/Atom.这是etree代码的更正版本.
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
Run Code Online (Sandbox Code Playgroud)
问题是名称空间.
运行这个:
cooking_parsed.getroot().tag
Run Code Online (Sandbox Code Playgroud)
你会看到该元素被命名为
{http://www.w3.org/2005/Atom}feed
Run Code Online (Sandbox Code Playgroud)
同样,如果您导航到其中一个Feed条目.
这意味着lxml中的正确xpath是:
print cooking_parsed.xpath(
"//a:feed/a:entry",
namespaces={ 'a':'http://www.w3.org/2005/Atom' })
Run Code Online (Sandbox Code Playgroud)