相关疑难解决方法(0)

lxml中的解析函数出错

我在Windows平台上安装了lxml2.2.2(即使用python版本2.6.5).我尝试了这个简单的命令:

from lxml.html import parse 
p= parse(‘http://www.google.com’).getroot()
Run Code Online (Sandbox Code Playgroud)

但我收到以下错误:

Traceback (most recent call last):
File “”, line 1, in p=parse(‘http://www.google.com’).getroot()
File “C:\Python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\html_init_.py”, line 661, in parse return etree.parse(filenameorurl, parser, baseurl=baseurl, **kw) 
File “lxml.etree.pyx”, line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:49590) 
File “parser.pxi”, line 1491, in lxml.etree.parseDocument (src/lxml/lxml.etree.c:71205) File “parser.pxi”, line 1520, in lxml.etree.parseDocumentFromURL (src/lxml/lxml.etree.c:71488) 
File “parser.pxi”, line 1420, in lxml.etree.parseDocFromFile (src/lxml/lxml.etree.c:70583)
File “parser.pxi”, line 975, in lxml.etree.BaseParser.parseDocFrom
File (src/lxml/lxml.etree.c:67736)
File “parser.pxi”, line 539, in lxml.etree.ParserContext.handleParseResultDoc (src/lxml/lxml.etree.c:63820) 
File “parser.pxi”, line 625, in lxml.etree.handleParseResult (src/lxml/lxml.etree.c:64741) …
Run Code Online (Sandbox Code Playgroud)

python windows parsing lxml

12
推荐指数
1
解决办法
9622
查看次数

HTML编码和lxml解析

我试图最终解决从尝试使用lxml抓取HTML时弹出的一些编码问题.以下是我遇到的三个示例HTML文档:

1.

<!DOCTYPE html>
<html lang='en'>
<head>
   <title>Unicode Chars: ? —’</title>
   <meta charset='utf-8'>
</head>
<body></body>
</html>
Run Code Online (Sandbox Code Playgroud)

2.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
    <title>Unicode Chars: ? —’</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>
Run Code Online (Sandbox Code Playgroud)

3.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Unicode Chars: ? —’</title>
</head>
<body></body>
</html>
Run Code Online (Sandbox Code Playgroud)

我的基本脚本:

from lxml.html import fromstring
...

doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title
Run Code Online (Sandbox Code Playgroud)

结果是: …

python unicode lxml beautifulsoup web-scraping

9
推荐指数
1
解决办法
9995
查看次数

标签 统计

lxml ×2

python ×2

beautifulsoup ×1

parsing ×1

unicode ×1

web-scraping ×1

windows ×1