我在Windows平台上安装了lxml2.2.2(即使用python版本2.6.5).我尝试了这个简单的命令:
from lxml.html import parse
p= parse(‘http://www.google.com’).getroot()
Run Code Online (Sandbox Code Playgroud)
但我收到以下错误:
Traceback (most recent call last):
File “”, line 1, in p=parse(‘http://www.google.com’).getroot()
File “C:\Python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\html_init_.py”, line 661, in parse return etree.parse(filenameorurl, parser, baseurl=baseurl, **kw)
File “lxml.etree.pyx”, line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:49590)
File “parser.pxi”, line 1491, in lxml.etree.parseDocument (src/lxml/lxml.etree.c:71205) File “parser.pxi”, line 1520, in lxml.etree.parseDocumentFromURL (src/lxml/lxml.etree.c:71488)
File “parser.pxi”, line 1420, in lxml.etree.parseDocFromFile (src/lxml/lxml.etree.c:70583)
File “parser.pxi”, line 975, in lxml.etree.BaseParser.parseDocFrom
File (src/lxml/lxml.etree.c:67736)
File “parser.pxi”, line 539, in lxml.etree.ParserContext.handleParseResultDoc (src/lxml/lxml.etree.c:63820)
File “parser.pxi”, line 625, in lxml.etree.handleParseResult (src/lxml/lxml.etree.c:64741) …Run Code Online (Sandbox Code Playgroud) 我是python的新手,并且正在使用它在我的项目中使用nltk.在对从网页获取的原始数据进行单词标记后,我得到了一个包含'\ xe2','\ xe3','\ x98'等的列表.但是我不需要这些并想要删除它们.
我只是试过
if '\x' in a
Run Code Online (Sandbox Code Playgroud)
和
if a.startswith('\xe')
Run Code Online (Sandbox Code Playgroud)
并且它给我一个错误,说无效\ x转义
但是当我尝试正则表达式时
re.search('^\\x',a)
Run Code Online (Sandbox Code Playgroud)
我明白了
Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
print re.search('^\\x',a)
File "C:\Python26\lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape: '\\x'
Run Code Online (Sandbox Code Playgroud)
甚至re.search('^ \\ x',a)也没有识别它.
我很困惑,甚至谷歌搜索没有帮助(我可能会遗漏一些东西).请建议任何简单的方法从列表中删除这些字符串以及上面的错误.
提前致谢!