6 python xpath lxml html-table web-scraping
这是我要从中提取数据的 html 页面的源代码。
网页:http : //gbgfotboll.se/information/?scr= table& ftid= 51168表格在页面底部
<html>
<table class="clCommonGrid" cellspacing="0">
<thead>
<tr>
<td colspan="3">Kommande matcher</td>
</tr>
<tr>
<th style="width:1%;">Tid</th>
<th style="width:69%;">Match</th>
<th style="width:30%;">Arena</th>
</tr>
</thead>
<tbody class="clGrid">
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span>
</td>
<td><a href="?scr=result&fmid=2669197">Guldhedens IK - IF Warta</a></td>
<td><a href="?scr=venue&faid=847">Guldheden Södra 1 Konstgräs</a> </td>
</tr>
<tr class="clTrEven">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-26<!-- br ok --> 13:00</span></span>
</td>
<td><a href="?scr=result&fmid=2669176">Romelanda UF - IK Virgo</a></td>
<td><a href="?scr=venue&faid=941">Romevi 1 Gräs</a> </td>
</tr>
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-27<!-- br ok --> 13:00</span></span>
</td>
<td><a href="?scr=result&fmid=2669167">Kode IF - IK Kongahälla</a></td>
<td><a href="?scr=venue&faid=912">Kode IP 1 Gräs</a> </td>
</tr>
<tr class="clTrEven">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-27<!-- br ok --> 14:00</span></span>
</td>
<td><a href="?scr=result&fmid=2669147">Floda BoIF - Partille IF FK </a></td>
<td><a href="?scr=venue&faid=218">Flodala IP 1</a> </td>
</tr>
</tbody>
</table>
</html>
Run Code Online (Sandbox Code Playgroud)
现在我有这段代码可以实际产生我想要的结果..
import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
for i in range(12):
xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
time = html.xpath(xpath1)[1]
date = html.xpath(xpath1)[0]
teamName = html.xpath(xpath2)[0]
if date == '2014-09-27':
print time, teamName
Run Code Online (Sandbox Code Playgroud)
给出结果:
13:00 Romelanda UF - IK 处女座
13:00 Kode IF - IK Kongahälla
14:00 Floda BoIF - Partille IF FK
现在来回答这个问题。我不想在范围内使用 for 循环,因为它不稳定,该表中的行可以更改,如果超出范围,它将崩溃。所以我的问题是如何以安全的方式进行迭代。这意味着它将遍历表中可用的所有行。不多不少。此外,如果您有任何其他建议使代码更好/更快,请继续。
以下代码将迭代任何行数。rows_xpath 将直接过滤目标日期。xpaths 也是在 for 循环之外创建一次,所以它应该更快。
import lxml.html
from lxml.etree import XPath
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
date = '2014-09-27'
rows_xpath = XPath("//*[@id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
time_xpath = XPath("td[1]/span/span//text()[2]")
team_xpath = XPath("td[2]/a/text()")
html = lxml.html.parse(url)
for row in rows_xpath(html):
time = time_xpath(row)[0].strip()
team = team_xpath(row)[0]
print time, team
Run Code Online (Sandbox Code Playgroud)