如何使用lxml在XHTML文档中查找元素文本

mch*_*ctt 5 python xpath lxml

多年来我一直在抨击我,我一定是在做一些愚蠢的事情.

我试图检索所有可能的维基百科支持的语言,并通过遍历List_of_Wikipedias上的表将它们输出到文本文件.

这是我到目前为止的python代码,它只是试图检索其中一个表:

import httplib
from lxml import etree

def main():
    conn = httplib.HTTPConnection("meta.wikimedia.org")
    conn.request("GET","/wiki/List_of_Wikipedias")
    res = conn.getresponse()
    root = etree.fromstring(res.read())
    table = root.xpath('//table')
    print table

main()
Run Code Online (Sandbox Code Playgroud)

在我的机器上,这只打印一个空列表.为了提高速度,我在本地缓存了页面并使用了:

wikipage = open("wikipage.html")
root = lxml.parse(wikipage)
Run Code Online (Sandbox Code Playgroud)

但这没有任何影响(除了显而易见的加速).我也试过了

lxml.find('table')
Run Code Online (Sandbox Code Playgroud)

和:

for element in root.iter():
    print("%s - %s" % (element.tag, element.text))
Run Code Online (Sandbox Code Playgroud)

它成功地打印出所有元素,所以我知道正在创建树.

我究竟做错了什么?

任何帮助,将不胜感激.谢谢.

Dim*_*hev 3

I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias\n
Run Code Online (Sandbox Code Playgroud)\n\n

您的问题是文档中的元素名称位于默认命名空间中。如何编写涉及此类元素名称的 XPath 表达式是 XPath 中最常见的常见问题,并且在 SO xpath 标签中有许多很好的答案。只要寻找他们就可以了。

\n\n

这是一个完整的解决方案:

\n\n

使用

\n\n
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()\n
Run Code Online (Sandbox Code Playgroud)\n\n

您已注册"http://www.w3.org/1999/xhtml"绑定到前缀的 XHTML 命名空间 ( )"x" ) 。

\n\n

当我根据从http://s23.org/wikistats/wikipedias_html获得的文档评估此 XPath 表达式时

\n\n

我需要在文档的开头添加以下内容,因为我在本地工作并且没有 XHTML 的 DTD —— 也许您不需要这些:

\n\n
<!DOCTYPE html [\n<!ENTITY uarr "&#8593;">\n<!ENTITY darr "&#8595;">\n<!ENTITY ccedil "&#199;">\n<!ENTITY oslash "&#216;">\n<!ENTITY aacute "&#225;">\n<!ENTITY aring "&#229;">\n<!ENTITY agrave "&#192;">\n<!ENTITY egrave "&#232;">\n<!ENTITY ograve "&#210;">\n<!ENTITY ocirc "&#244;">\n]>\n
Run Code Online (Sandbox Code Playgroud)\n\n

将上述 XPath 表达式应用到该文档的结果是

\n\n
                    English\n\n                    German\n\n                    French\n\n                    Polish\n\n                    Italian\n\n                    Japanese\n\n                    Spanish\n\n                    Portuguese\n\n                    Dutch\n\n                    Russian\n\n                    Swedish\n\n                    Chinese\n\n                    Catalan\n\n                    Norwegian (Bokm\xc3\xa5l)\n\n                    Finnish\n\n                    Ukrainian\n\n                    Czech\n\n                    Hungarian\n\n                    Romanian\n\n                    Korean\n\n                    Turkish\n\n                    Vietnamese\n\n                    Indonesian\n\n                    Danish\n\n                    Arabic\n\n                    Esperanto\n\n                    Serbian\n\n                    Lithuanian\n\n                    Slovak\n\n                    Volap\xc3\xbck\n\n                    Persian\n\n                    Hebrew\n\n                    Bulgarian\n\n                    Slovenian\n\n                    Malay\n\n                    Waray-Waray\n\n                    Croatian\n\n                    Estonian\n\n                    Newar / Nepal Bhasa\n\n                    Simple English\n\n                    Hindi\n\n                    Galician\n\n                    Thai\n\n                    Basque\n\n                    Norwegian (Nynorsk)\n\n                    Aromanian\n\n                    Greek\n\n                    Haitian\n\n                    Azerbaijani\n\n                    Tagalog\n\n                    Latin\n\n                    Telugu\n\n                    Georgian\n\n                    Macedonian\n\n                    Cebuano\n\n                    Serbo-Croatian\n\n                    Breton\n\n                    Piedmontese\n\n                    Marathi\n\n                    Latvian\n\n                    Luxembourgish\n\n                    Javanese\n\n                    Belarusian (Tara\xc5\xa1kievica)\n\n                    Welsh\n\n                    Icelandic\n\n                    Bosnian\n\n                    Albanian\n\n                    Tamil\n\n                    Belarusian\n\n                    Bishnupriya Manipuri\n\n                    Aragonese\n\n                    Occitan\n\n                    Bengali\n\n                    Swahili\n\n                    Ido\n\n                    Lombard\n\n                    West Frisian\n\n                    Gujarati\n\n                    Afrikaans\n\n                    Low Saxon\n\n                    Malayalam\n\n                    Quechua\n\n                    Sicilian\n\n                    Urdu\n\n                    Kurdish\n\n                    Cantonese\n\n                    Sundanese\n\n                    Asturian\n\n                    Neapolitan\n\n                    Samogitian\n\n                    Armenian\n\n                    Yoruba\n\n                    Irish\n\n                    Chuvash\n\n                    Walloon\n\n                    Nepali\n\n                    Ripuarian\n\n                    Western Panjabi\n\n                    Kannada\n\n                    Tajik\n\n                    Tarantino\n\n                    Venetian\n\n                    Yiddish\n\n                    Scottish Gaelic\n\n                    Tatar\n\n                    Min Nan\n\n                    Ossetian\n\n                    Uzbek\n\n                    Alemannic\n\n                    Kapampangan\n\n                    Sakha\n\n                    Egyptian Arabic\n\n                    Kazakh\n\n                    Maori\n\n                    Limburgian\n\n                    Amharic\n\n                    Nahuatl\n\n                    Upper Sorbian\n\n                    Gilaki\n\n                    Corsican\n\n                    Gan\n\n                    Mongolian\n\n                    Scots\n\n                    Interlingua\n\n                    Central_Bicolano\n\n                    Burmese\n\n                    Faroese\n\n                    V\xc3\xb5ro\n\n                    Dutch Low Saxon\n\n                    Sinhalese\n\n                    Turkmen\n\n                    West Flemish\n\n                    Sanskrit\n\n                    Bavarian\n\n                    Malagasy\n\n                    Manx\n\n                    Ilokano\n\n                    Divehi\n\n                    Norman\n\n                    Pangasinan\n\n                    Banyumasan\n\n                    Sorani\n\n                    Romansh\n\n                    Northern Sami\n\n                    Zazaki\n\n                    Mazandarani\n\n                    Wu\n\n                    Friulian\n\n                    Uyghur\n\n                    Ligurian\n\n                    Maltese\n\n                    Bihari\n\n                    Novial\n\n                    Tibetan\n\n                    Anglo-Saxon\n\n                    Kashubian\n\n                    Sardinian\n\n                    Classical Chinese\n\n                    Fiji Hindi\n\n                    Khmer\n\n                    Ladino\n\n                    Zamboanga Chavacano\n\n                    Pali\n\n                    Franco-Proven\xc3\xa7al/Arpitan\n\n                    Pashto\n\n                    Hakka\n\n                    Cornish\n\n                    Punjabi\n\n                    Navajo\n\n                    Silesian\n\n                    Kalmyk\n\n                    Pennsylvania German\n\n                    Hawaiian\n\n                    Saterland Frisian\n\n                    Interlingue\n\n                    Somali\n\n                    Komi\n\n                    Karachay-Balkar\n\n                    Crimean Tatar\n\n                    Tongan\n\n                    Acehnese\n\n                    Meadow Mari\n\n                    Picard\n\n                    Erzya\n\n                    Lingala\n\n                    Kinyarwanda\n\n                    Extremaduran\n\n                    Guarani\n\n                    Kirghiz\n\n                    Emilian-Romagnol\n\n                    Assyrian Neo-Aramaic\n\n                    Papiamentu\n\n                    Aymara\n\n                    Chechen\n\n                    Lojban\n\n                    Wolof\n\n                    Banjar\n\n                    Bashkir\n\n                    North Frisian\n\n                    Greenlandic\n\n                    Tok Pisin\n\n                    Udmurt\n\n                    Kabyle\n\n                    Tahitian\n\n                    Sranan\n\n                    Zealandic\n\n                    Hill Mari\n\n                    Komi-Permyak\n\n                    Lower Sorbian\n\n                    Abkhazian\n\n                    Gagauz\n\n                    Igbo\n\n                    Oriya\n\n                    Lao\n\n                    Kongo\n\n                    Avar\n\n                    Moksha\n\n                    Mirandese\n\n                    Romani\n\n                    Old Church Slavonic\n\n                    Karakalpak\n\n                    Samoan\n\n                    Moldovan\n\n                    Tetum\n\n                    Gothic\n\n                    Kashmiri\n\n                    Bambara\n\n                    Inupiak\n\n                    Sindhi\n\n                    Bislama\n\n                    Lak\n\n                    Nauruan\n\n                    Norfolk\n\n                    Inuktitut\n\n                    Pontic\n\n                    Assamese\n\n                    Cherokee\n\n                    Min Dong\n\n                    Swati\n\n                    Palatinate German\n\n                    Hausa\n\n                    Ewe\n\n                    Tigrinya\n\n                    Oromo\n\n                    Zulu\n\n                    Zhuang\n\n                    Venda\n\n                    Tsonga\n\n                    Kirundi\n\n                    Dzongkha\n\n                    Sango\n\n                    Cree\n\n                    Chamorro\n\n                    Luganda\n\n                    Buginese\n\n                    Buryat (Russia)\n\n                    Fijian\n\n                    Chichewa\n\n                    Akan\n\n                    Sesotho\n\n                    Xhosa\n\n                    Fula\n\n                    Tswana\n\n                    Kikuyu\n\n                    Tumbuka\n\n                    Shona\n\n                    Twi\n\n                    Cheyenne\n\n                    Ndonga\n\n                    Sichuan Yi\n\n                    Choctaw\n\n                    Marshallese\n\n                    Afar\n\n                    Kuanyama\n\n                    Hiri Motu\n\n                    Muscogee\n\n                    Kanuri\n\n                    Herero\n
Run Code Online (Sandbox Code Playgroud)\n\n

请注意:每隔一个选定的节点都是一个仅包含空格的文本节点。如果您不想选择这些,请使用:

\n\n
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()]\n
Run Code Online (Sandbox Code Playgroud)\n