从ElementTree findall返回的空列表

lil*_*oka 14 python xml parsing elementtree wikimedia-dumps

我是xml解析和Python的新手,所以请耐心等待.我正在使用lxml来解析wiki转储,但我只想要每个页面,它的标题和文本.

现在我有了这个:

from xml.etree import ElementTree as etree

def parser(file_name):
    document = etree.parse(file_name)
    titles = document.findall('.//title')
    print titles
Run Code Online (Sandbox Code Playgroud)

目前,冠军没有返回任何东西.我已经看过像这样的前面的答案:ElementTree findall()返回空列表和lxml文档,但大多数事情似乎都是为解析HTML而定制的.

这是我的XML的一部分:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/"     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
  <siteinfo>
  <sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.20wmf9</generator>
<case>first-letter</case>
<namespaces>
  <namespace key="-2" case="first-letter">Media</namespace>
  <namespace key="-1" case="first-letter">Special</namespace>
  <namespace key="0" case="first-letter" />
  <namespace key="1" case="first-letter">Talk</namespace>
  <namespace key="2" case="first-letter">User</namespace>
  <namespace key="3" case="first-letter">User talk</namespace>
  <namespace key="4" case="first-letter">Wikipedia</namespace>
  <namespace key="5" case="first-letter">Wikipedia talk</namespace>
  <namespace key="6" case="first-letter">File</namespace>
  <namespace key="7" case="first-letter">File talk</namespace>
  <namespace key="8" case="first-letter">MediaWiki</namespace>
  <namespace key="9" case="first-letter">MediaWiki talk</namespace>
  <namespace key="10" case="first-letter">Template</namespace>
  <namespace key="11" case="first-letter">Template talk</namespace>
  <namespace key="12" case="first-letter">Help</namespace>
  <namespace key="13" case="first-letter">Help talk</namespace>
  <namespace key="14" case="first-letter">Category</namespace>
  <namespace key="15" case="first-letter">Category talk</namespace>
  <namespace key="100" case="first-letter">Portal</namespace>
  <namespace key="101" case="first-letter">Portal talk</namespace>
  <namespace key="108" case="first-letter">Book</namespace>
  <namespace key="109" case="first-letter">Book talk</namespace>
</namespaces>
  </siteinfo>
  <page>
    <title>Aratrum</title>
    <ns>0</ns>
    <id>65741</id>
    <revision>
  <id>349931990</id>
  <parentid>225434394</parentid>
  <timestamp>2010-03-15T02:55:02Z</timestamp>
  <contributor>
    <ip>143.105.193.119</ip>
  </contributor>
  <comment>/* Sources */</comment>
  <sha1>2zkdnl9nsd1fbopv0fpwu2j5gdf0haw</sha1>
  <text xml:space="preserve" bytes="1436">'''Aratrum''' is the Latin word for  [[plough]], and &quot;arotron&quot; (???????) is the [[Greek language|Greek]] word. The   [[Ancient Greece|Greeks]] appear to have had diverse kinds of plough from the earliest  historical records. [[Hesiod]] advised the farmer to have always two ploughs, so that if  one broke the other might be ready for use. These ploughs should be of two kinds, the one  called &quot;autoguos&quot; (????????, &quot;self-limbed&quot;), in which the plough-tail  was of the same piece of timber as the share-beam and the pole; and the other called  &quot;pekton&quot; (??????, &quot;fixed&quot;), because in it, three parts, which were of  three kinds of timber, were adjusted to one another, and fastened together by nails.

The ''autoguos'' plough was made from a [[sapling]] with two branches growing from its   trunk in opposite directions. In ploughing, the trunk served as the pole, one of the two     branches stood upwards and became the tail, and the other penetrated the ground and,    sometimes shod with bronze or iron, acted as the [[ploughshare]]. 

==Sources==
Based on an article from ''A Dictionary of Greek and Roman Antiquities,'' John Murray,     London, 1875.
???????

==External links==
*[http://penelope.uchicago.edu/Thayer/E/Roman/Texts/secondary/SMIGRA*/Aratrum.html Smith's     Dictionary article], with diagrams, further details, sources.
[[Category:Agricultural machinery]]
[[Category:Ancient Greece]]
[[Category:Animal equipment]]</text>
</revision>
</page>
Run Code Online (Sandbox Code Playgroud)

我也尝试过iterparse,然后打印它找到的元素的标签:

for e in etree.iterparse(file_name):
    print e.tag
Run Code Online (Sandbox Code Playgroud)

但它抱怨没有标签属性.

编辑: 截图

mzj*_*zjn 26

问题是您没有考虑XML命名空间.XML文档(及其中的所有元素)位于http://www.mediawiki.org/xml/export-0.7/命名空间中.为了使它工作,你需要改变

titles = document.findall('.//title')
Run Code Online (Sandbox Code Playgroud)

titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
Run Code Online (Sandbox Code Playgroud)

命名空间也可以通过namespaces参数提供:

NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
titles = document.findall('.//mw:title', namespaces=NSMAP)
Run Code Online (Sandbox Code Playgroud)

这适用于Python 2.7,但在Python 2.7文档中没有解释(Python 3.3文档更好).

另请参阅http://effbot.org/zone/element-namespaces.htm和这个问题的答案:通过'ElementTree'在Python中解析带有命名空间的XML.


麻烦iterparse()是由于这个函数提供了(event, element)元组(而不仅仅是元素).要获取标记名称,请更改

for e in etree.iterparse(file_name):
    print e.tag
Run Code Online (Sandbox Code Playgroud)

对此:

for e in etree.iterparse(file_name):
    print e[1].tag
Run Code Online (Sandbox Code Playgroud)

  • 这正是我想要的!谢谢! (2认同)