Ana*_*sev 1 python parsing lxml
我需要解析以下结构的html表:
<table class="table1" width="620" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr width="620">
<th width="620">Smth1</th>
...
</tr>
<tr bgcolor="ffffff" width="620">
<td width="620">Smth2</td>
...
</tr>
<tr bgcolor="E4E4E4" width="620">
<td width="620">Smth3</td>
...
</tr>
<tr bgcolor="ffffff" width="620">
<td width="620">Smth4</td>
...
</tr>
</tbody>
</table>
Run Code Online (Sandbox Code Playgroud)
Python代码:
r = requests.post(url,data)
html = lxml.html.document_fromstring(r.text)
rows = html.xpath(xpath1)[0].findall("tr")
#Getting Xpath with FireBug
data = list()
for row in rows:
data.append([c.text for c in row.getchildren()])
Run Code Online (Sandbox Code Playgroud)
但我在第三行得到了这个:
IndexError: list index out of range
Run Code Online (Sandbox Code Playgroud)
任务是从中形成python dict.行数可能不同.
UPD. 改变了我获取HTML代码的方式,以避免请求lib的可能问题.现在它是一个简单的网址:
html = lxml.html.parse(test_url)
Run Code Online (Sandbox Code Playgroud)
这证明每一个都是好的html:
lxml.html.open_in_browser(html)
Run Code Online (Sandbox Code Playgroud)
但仍然是同样的问题:
rows = html.xpath(xpath1)[0].findall('tr')
data = list()
for row in rows:
data.append([c.text for c in row.getchildren()])
Run Code Online (Sandbox Code Playgroud)
这是xpath1:
'/html/body/table/tbody/tr[5]/td/table/tbody/tr/td[2]/table/tbody/tr/td/center/table'
Run Code Online (Sandbox Code Playgroud)
UPD2.通过实验发现,xpath崩溃了:
xpath1 = '/html/body/table/tbody'
print html.xpath(xpath1)
#print returns []
Run Code Online (Sandbox Code Playgroud)
如果xpath1较短,那么它seeem很好地返回工作[<Element table at 0x2cbadb0>]的xpath1 = '/html/body/table'
你没有包含XPath,所以我不确定你要做什么,但如果我理解正确,这应该工作
xpath1 = "tbody/tr"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
rows = html.xpath(xpath1)
data = list()
for row in rows:
data.append([c.text for c in row.getchildren()])
Run Code Online (Sandbox Code Playgroud)
这是列出一个项目列表,如下所示:
[['Smth1'], ['Smth2'], ['Smth3'], ['Smth4']]
Run Code Online (Sandbox Code Playgroud)
要获得简单的值列表,可以使用此代码
xpath1 = "tbody/tr/*/text()"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
data = html.xpath(xpath1)
Run Code Online (Sandbox Code Playgroud)
这都是假设r.text正是你在那里发布的内容.