lxml - 难以解析stackexchange rss feed

MrC*_*tro 5 python rss lxml xml-parsing

HIA

我在python中从stackexchange解析rss feed时遇到问题.当我尝试获取摘要节点时,将返回一个空列表

我一直试图解决这个问题,但无法理解.

任何人都可以帮忙吗?谢谢

In [3o]: import lxml.etree, urllib2

In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' 

In [32]: cooking_content = urllib2.urlopen(url_cooking)

In [33]: cooking_parsed = lxml.etree.parse(cooking_content)

In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')

In [35]: cooking_texts
Out[35]: []
Run Code Online (Sandbox Code Playgroud)

In [3o]: import lxml.etree, urllib2

In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' 

In [32]: cooking_content = urllib2.urlopen(url_cooking)

In [33]: cooking_parsed = lxml.etree.parse(cooking_content)

In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')

In [35]: cooking_texts
Out[35]: []
Run Code Online (Sandbox Code Playgroud)

gfo*_*une 9

看看这两个版本

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
Run Code Online (Sandbox Code Playgroud)

正如您所发现的,第二个版本不返回节点,但lxml.html版本工作正常.该etree版本无法正常工作,因为它期望命名空间,并且html版本正在运行,因为它忽略了命名空间.在http://lxml.de/lxmlhtml.html中,它说"HTML解析器显着忽略了命名空间和其他一些XML主义".

请注意,当您打印etree版本(print(data.getroot()))的根节点时,您会得到类似的结果<Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>.这意味着它是一个名称空间为的feed元素http://www.w3.org/2005/Atom.这是etree代码的更正版本.

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}

data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
Run Code Online (Sandbox Code Playgroud)


Mic*_*son 6

问题是名称空间.

运行这个:

 cooking_parsed.getroot().tag
Run Code Online (Sandbox Code Playgroud)

你会看到该元素被命名为

{http://www.w3.org/2005/Atom}feed
Run Code Online (Sandbox Code Playgroud)

同样,如果您导航到其中一个Feed条目.

这意味着lxml中的正确xpath是:

print cooking_parsed.xpath(
  "//a:feed/a:entry",
  namespaces={ 'a':'http://www.w3.org/2005/Atom' })
Run Code Online (Sandbox Code Playgroud)