tom*_*ith 12 python lxml element elementtree
以下测试读取文件,并使用lxml.html为页面生成DOM/Graph的叶节点.
但是,我也试图弄清楚如何从"字符串"获取输入.运用
lxml.html.fromstring(s)
Run Code Online (Sandbox Code Playgroud)
不起作用,因为这会生成"元素"而不是"ElementTree".
所以,我想弄清楚如何将元素转换为ElementTree.
思考
import lxml.html
from lxml import etree # trying this to see if needed
# to convert from element to elementtree
#cmd='cat osu_test.txt'
cmd='cat o2.txt'
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
s=proc.communicate()[0].strip()
# s contains HTML not XML text
#doc = lxml.html.parse(s)
doc = lxml.html.parse('osu_test.txt')
doc1 = lxml.html.fromstring(s)
for node in doc.iter():
if len(node) == 0:
print "aaa ",node.tag, doc.getpath(node)
#print "aaa ",node.tag
nt = etree.ElementTree(doc1) <<<<< doesn't work.. so what will??
for node in nt.iter():
if len(node) == 0:
print "aaa ",node.tag, doc.getpath(node)
#print "aaa ",node.tag
Run Code Online (Sandbox Code Playgroud)
===============================
更新:::
(解析html而不是xml)添加了Abbas建议的更改.得到以下错误:
doc1 = etree.fromstring(s)
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48621)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72232)
File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71093)
File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67862)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 48, column 220
Run Code Online (Sandbox Code Playgroud)
UPDATE :::
管理以使测试工作.我不确定为什么.如果有py chop的人想要提供解释,这将有助于未来的人偶然发现这一点.
from cStringIO import StringIO
from lxml.html import parse
doc1 = parse(StringIO(s))
for node in doc1.iter():
if len(node) == 0:
print "aaa ", node.tag, doc1.getpath(node)
Run Code Online (Sandbox Code Playgroud)
似乎StringIO模块/类实现了IO功能,它满足了解析包需要继续处理测试html的输入字符串.类似于铸造提供的其他语言也许......
谢谢
要从_Element(使用lxml.html.fromstring)生成根树,可以使用以下getroottree方法:
doc = lxml.html.parse(s)
tree = doc.getroottree()
Run Code Online (Sandbox Code Playgroud)
该etree.fromstring方法解析 XML 字符串并返回根元素。该类etree.ElementTree是元素的树包装,因此需要一个元素进行实例化。
因此,将根元素传递给etree.ElementTree()构造函数应该可以满足您的需求:
root = etree.fromstring(s)
nt = etree.ElementTree(root)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7579 次 |
| 最近记录: |