有没有一种干净的方法来使用BeautifulSoup获取html表的第n列？

Question

有没有一种干净的方法来使用BeautifulSoup获取html表的第n列？

Ben*_*hoo 5 python html-table beautifulsoup

假设我们查看页面中的第一个表,所以:

table = BeautifulSoup(...).table

Run Code Online (Sandbox Code Playgroud)

可以使用干净的for循环扫描行:

for row in table:
    f(row)

Run Code Online (Sandbox Code Playgroud)

但是为了获得一个列,事情会变得混乱.

我的问题:是否有一种优雅的方法来提取单个列,无论是通过其位置还是通过其"名称"(即出现在此列第一行中的文本)？

Answer 1

Chr*_*ell 5

lxml比BeautifulSoup快很多倍,所以你可能想要使用它.

from lxml.html import parse
doc = parse('http://python.org').getroot()
for row in doc.cssselect('table > tr'):
    for cell in row.cssselect('td:nth-child(3)'):
         print cell.text_content()

Run Code Online (Sandbox Code Playgroud)

或者,而不是循环:

rows = [ row for row in doc.cssselect('table > tr') ]
cells = [ cell.text_content() for cell in rows.cssselect('td:nth-child(3)') ]
print cells

Run Code Online (Sandbox Code Playgroud)

归档时间：	15 年，3 月前
查看次数：	1299 次
最近记录：	9 年，4 月前