在运行python脚本时,我遇到了这个错误
from lxml import etree
ImportError: No module named lxml
Run Code Online (Sandbox Code Playgroud)
现在我尝试安装lxml
sudo easy_install lmxl
Run Code Online (Sandbox Code Playgroud)
但它给了我以下错误
Building lxml version 2.3.beta1.
NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available.
ERROR: /bin/sh: xslt-config: not found
** make sure the development packages of libxml2 and libxslt are installed **
Run Code Online (Sandbox Code Playgroud)
使用libxslt的构建配置
src/lxml/lxml.etree.c:4: fatal error: Python.h: No such file or directory
compilation terminated.
error: Setup script exited with error: command 'gcc' failed with exit status 1
Run Code Online (Sandbox Code Playgroud) 我在安装时遇到问题lxml.我已尝试过本网站和其他网站的相关问题的解决方案,但无法解决问题.需要一些建议/解决方案.
我在执行后提供完整的日志pip install lxml,
Downloading/unpacking lxml
Downloading lxml-3.3.5.tar.gz (3.5MB): 3.5MB downloaded
Running setup.py (path:/tmp/pip_build_root/lxml/setup.py) egg_info for package lxml
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
Building lxml version 3.3.5.
Building without Cython.
Using build configuration of libxslt 1.1.28
warning: no previously-included files found matching '*.py'
Installing collected packages: lxml
Running setup.py install for lxml
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
Building lxml version 3.3.5.
Building without Cython.
Using build configuration of libxslt 1.1.28
building 'lxml.etree' …Run Code Online (Sandbox Code Playgroud) 我想编写一个代码片段,它将<content>在下面所有三个实例(包括代码标记)中的lxml中获取标记内的所有文本.我已经尝试了tostring(getchildren())但是会遗漏标签之间的文字.我没有太多运气在API中搜索相关功能.你能救我吗?
<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>
<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"
<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"
Run Code Online (Sandbox Code Playgroud) 我正在使用lxml.html生成一些HTML.我想打印(带缩进)我的最终结果到一个html文件.我怎么做?
这是我迄今为止所尝试过的(我对Python和lxml相对较新):
import lxml.html as lh
from lxml.html import builder as E
sliderRoot=lh.Element("div", E.CLASS("scroll"), style="overflow-x: hidden; overflow-y: hidden;")
scrollContainer=lh.Element("div", E.CLASS("scrollContainer"), style="width: 4340px;")
sliderRoot.append(scrollContainer)
print lh.tostring(sliderRoot, pretty_print = True, method="html")
Run Code Online (Sandbox Code Playgroud)
如您所见,我正在使用该pretty_print=True属性.我认为这会给缩进代码,但它并没有真正帮助.这是输出:
<div style="overflow-x: hidden; overflow-y: hidden;" class="scroll"><div style="width: 4340px;" class="scrollContainer"></div></div>
据我所知,Python中的两个主要HTML解析库是lxml和BeautifulSoup.我选择了BeautifulSoup作为我正在研究的项目,但除了找到更容易学习和理解的语法之外,我选择了它.但是我看到很多人似乎都喜欢lxml而且我听说lxml更快.
所以我想知道一个优于另一个的优点是什么?我什么时候想使用lxml?什么时候最好使用BeautifulSoup?还有其他值得考虑的图书馆吗?
我有一个xml doc,我试图使用Etree.lxml解析
<Envelope xmlns="http://www.example.com/zzz/yyy">
<Header>
<Version>1</Version>
</Header>
<Body>
some stuff
<Body>
<Envelope>
Run Code Online (Sandbox Code Playgroud)
我的代码是:
path = "path to xml file"
from lxml import etree as ET
parser = ET.XMLParser(ns_clean=True)
dom = ET.parse(path, parser)
dom.getroot()
Run Code Online (Sandbox Code Playgroud)
当我尝试获取dom.getroot()时,我得到:
<Element {http://www.example.com/zzz/yyy}Envelope at 28adacac>
Run Code Online (Sandbox Code Playgroud)
但是我只想要:
<Element Envelope at 28adacac>
Run Code Online (Sandbox Code Playgroud)
当我做
dom.getroot().find("Body")
Run Code Online (Sandbox Code Playgroud)
我没有得到任何回报.但是,当我
dom.getroot().find("{http://www.example.com/zzz/yyy}Body")
Run Code Online (Sandbox Code Playgroud)
我得到了一个结果.
我认为将ns_clean = True传递给解析器会阻止这种情况.
有任何想法吗?
<?xml version="1.0" ?>
<data>
<test >
<f1 />
</test >
<test2 >
<test3>
<f1 />
</test3>
</test2>
<f1 />
</data>
Run Code Online (Sandbox Code Playgroud)
使用lxml是否可以递归查找标签"f1"?我试过findall方法,但它只适用于直接的孩子.
我想我应该去为BeautifulSoup这个!!!
我正在使用lxml从头开始创建XML文件; 有这样的代码:
from lxml import etree
root = etree.Element("root")
root.set("interesting", "somewhat")
child1 = etree.SubElement(root, "test")
Run Code Online (Sandbox Code Playgroud)
如何Element使用类的write()方法将根对象写入xml文件ElementTree?
我需要解析一个xml文件来提取一些数据.我只需要一些具有某些属性的元素,这里是一个文档示例:
<root>
<articles>
<article type="news">
<content>some text</content>
</article>
<article type="info">
<content>some text</content>
</article>
<article type="news">
<content>some text</content>
</article>
</articles>
</root>
Run Code Online (Sandbox Code Playgroud)
在这里,我想只获得"新闻"类型的文章.用lxml做最有效和最优雅的方法是什么?
我尝试使用find方法,但它不是很好:
from lxml import etree
f = etree.parse("myfile")
root = f.getroot()
articles = root.getchildren()[0]
article_list = articles.findall('article')
for article in article_list:
if "type" in article.keys():
if article.attrib['type'] == 'news':
content = article.find('content')
content = content.text
Run Code Online (Sandbox Code Playgroud) 我想使用xpath表达式来获取属性的值.
我期望以下工作
from lxml import etree
for customer in etree.parse('file.xml').getroot().findall('BOB'):
print customer.find('./@NAME')
Run Code Online (Sandbox Code Playgroud)
但这会给出一个错误:
Traceback (most recent call last):
File "bob.py", line 22, in <module>
print customer.find('./@ID')
File "lxml.etree.pyx", line 1409, in lxml.etree._Element.find (src/lxml/lxml.etree.c:39972)
File "/usr/local/lib/python2.7/dist-packages/lxml/_elementpath.py", line 272, in find
it = iterfind(elem, path, namespaces)
File "/usr/local/lib/python2.7/dist-packages/lxml/_elementpath.py", line 262, in iterfind
selector = _build_path_iterator(path, namespaces)
File "/usr/local/lib/python2.7/dist-packages/lxml/_elementpath.py", line 246, in _build_path_iterator
selector.append(ops[token[0]](_next, token))
KeyError: '@'
Run Code Online (Sandbox Code Playgroud)
我错在期待这个工作吗?
lxml ×10
python ×10
find ×2
xml ×2
attributes ×1
elementtree ×1
html ×1
html-parsing ×1
install ×1
parsing ×1
pip ×1
pretty-print ×1
ubuntu-14.04 ×1
xml-parsing ×1