使用 python lxml xpath 遍历表中的所有行

Question

使用 python lxml xpath 遍历表中的所有行

6 python xpath lxml html-table web-scraping

这是我要从中提取数据的 html 页面的源代码。

网页：http : //gbgfotboll.se/information/?scr= table& ftid= 51168表格在页面底部

     <html>
               <table class="clCommonGrid" cellspacing="0">
                        <thead>
                            <tr>
                                <td colspan="3">Kommande matcher</td>
                            </tr>
                            <tr>
                                <th style="width:1%;">Tid</th>
                                <th style="width:69%;">Match</th>
                                <th style="width:30%;">Arena</th>
                            </tr>
                        </thead>

                        <tbody class="clGrid">

                    <tr class="clTrOdd">
                        <td nowrap="nowrap" class="no-line-through">
                            <span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span>



                        </td>
                        <td><a href="?scr=result&amp;fmid=2669197">Guldhedens IK - IF Warta</a></td>
                        <td><a href="?scr=venue&amp;faid=847">Guldheden Södra 1 Konstgräs</a> </td>
                    </tr>

                    <tr class="clTrEven">
                        <td nowrap="nowrap" class="no-line-through">
                            <span class="matchTid"><span>2014-09-26<!-- br ok --> 13:00</span></span>



                        </td>
                        <td><a href="?scr=result&amp;fmid=2669176">Romelanda UF - IK Virgo</a></td>
                        <td><a href="?scr=venue&amp;faid=941">Romevi 1 Gräs</a> </td>
                    </tr>

                    <tr class="clTrOdd">
                    <td nowrap="nowrap" class="no-line-through">
                        <span class="matchTid"><span>2014-09-27<!-- br ok --> 13:00</span></span>



                    </td>
                    <td><a href="?scr=result&amp;fmid=2669167">Kode IF - IK Kongahälla</a></td>
                    <td><a href="?scr=venue&amp;faid=912">Kode IP 1 Gräs</a> </td>
                </tr>

                <tr class="clTrEven">
                    <td nowrap="nowrap" class="no-line-through">
                        <span class="matchTid"><span>2014-09-27<!-- br ok --> 14:00</span></span>



                    </td>
                    <td><a href="?scr=result&amp;fmid=2669147">Floda BoIF - Partille IF FK </a></td>
                    <td><a href="?scr=venue&amp;faid=218">Flodala IP 1</a> </td>
                </tr>


                        </tbody>
                </table>
        </html>

Run Code Online (Sandbox Code Playgroud)

现在我有这段代码可以实际产生我想要的结果..

import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
for i in range(12):
    xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
    xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
    time = html.xpath(xpath1)[1]
    date = html.xpath(xpath1)[0]
    teamName = html.xpath(xpath2)[0]
    if date == '2014-09-27':
        print time, teamName

Run Code Online (Sandbox Code Playgroud)

给出结果：

13:00 Romelanda UF - IK 处女座

13:00 Kode IF - IK Kongahälla

14:00 Floda BoIF - Partille IF FK

现在来回答这个问题。我不想在范围内使用 for 循环，因为它不稳定，该表中的行可以更改，如果超出范围，它将崩溃。所以我的问题是如何以安全的方式进行迭代。这意味着它将遍历表中可用的所有行。不多不少。此外，如果您有任何其他建议使代码更好/更快，请继续。

Answer 1

Geo*_*tin 6

以下代码将迭代任何行数。rows_xpath 将直接过滤目标日期。xpaths 也是在 for 循环之外创建一次，所以它应该更快。

import lxml.html
from lxml.etree import XPath
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
date = '2014-09-27'

rows_xpath = XPath("//*[@id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
time_xpath = XPath("td[1]/span/span//text()[2]")
team_xpath = XPath("td[2]/a/text()")

html = lxml.html.parse(url)

for row in rows_xpath(html):
    time = time_xpath(row)[0].strip()
    team = team_xpath(row)[0]
    print time, team

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，5 月前
查看次数：	5956 次
最近记录：	8 年，11 月前