如何使用BeautifulSoup从特定表中获取所有行？

Question

如何使用BeautifulSoup从特定表中获取所有行？

我正在学习Python和BeautifulSoup来从网上抓取数据,并阅读HTML表格.我可以将它读入Open Office,它说它是表#11.

似乎BeautifulSoup是首选,但任何人都可以告诉我如何获取特定的表和所有行？我查看了模块文档,但无法理解它.我在网上找到的许多例子似乎比我需要的更多.

Answer 1

如果你有一大块HTML要用BeautifulSoup解析,这应该是非常简单的.一般的想法是使用该findChildren方法导航到您的表,然后您可以使用string属性获取单元格内的文本值.

>>> from BeautifulSoup import BeautifulSoup
>>> 
>>> html = """
... <html>
... <body>
...     <table>
...         <th><td>column 1</td><td>column 2</td></th>
...         <tr><td>value 1</td><td>value 2</td></tr>
...     </table>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> tables = soup.findChildren('table')
>>>
>>> # This will get the first (and only) table. Your page may have more.
>>> my_table = tables[0]
>>>
>>> # You can find children with multiple tags by passing a list of strings
>>> rows = my_table.findChildren(['th', 'tr'])
>>>
>>> for row in rows:
...     cells = row.findChildren('td')
...     for cell in cells:
...         value = cell.string
...         print "The value in this cell is %s" % value
... 
The value in this cell is column 1
The value in this cell is column 2
The value in this cell is value 1
The value in this cell is value 2
>>>

Run Code Online (Sandbox Code Playgroud)

是的,`.findChildren(['th','tr'])`正在搜索标签类型为`th`或`tr`的元素.如果你只想找到`tr`元素,你会使用`.findChildren('tr')`(注意不是列表,只是字符串) (2认同)

Answer 2

And*_*kha 5

如果您曾经有嵌套表（如在老式设计的网站上），上述方法可能会失败。

作为解决方案，您可能需要首先提取非嵌套表：

html = '''<table>
<tr>
<td>Top level table cell</td>
<td>
    <table>
    <tr><td>Nested table cell</td></tr>
    <tr><td>...another nested cell</td></tr>
    </table>
</td>
</tr>
</table>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
non_nested_tables = [t for t in soup.find_all('table') if not t.find_all('table')]

Run Code Online (Sandbox Code Playgroud)

或者，如果您想要提取所有表的内容（包括嵌套其他表的表），您可以仅提取顶级tr和th/td标题。为此，您需要在调用该方法时关闭递归find_all：

soup = BeautifulSoup(html, 'lxml')
tables = soup.find_all('table')
cnt = 0
for my_table in tables:
    cnt += 1
    print ('=============== TABLE {} ==============='.format(cnt))
    rows = my_table.find_all('tr', recursive=False)                  # <-- HERE
    for row in rows:
        cells = row.find_all(['th', 'td'], recursive=False)          # <-- HERE
        for cell in cells:
            # DO SOMETHING
            if cell.string: print (cell.string)

Run Code Online (Sandbox Code Playgroud)

输出：

=============== TABLE 1 ===============
Top level table cell
=============== TABLE 2 ===============
Nested table cell
...another nested cell

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年前
查看次数：	31681 次
最近记录：	7 年前