使用BeautifulSoup提取特定的TD表格元素文本？

Question

使用BeautifulSoup提取特定的TD表格元素文本？

我试图使用BeautifulSoup库从自动生成的HTML表中提取IP地址,我有点麻烦.

HTML的结构如下:

<html>
<body>
    <table class="mainTable">
    <thead>
        <tr>
            <th>IP</th>
            <th>Country</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="hello.html">127.0.0.1<a></td>
            <td><img src="uk.gif" /><a href="uk.com">uk</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">192.168.0.1<a></td>
            <td><img src="uk.gif" /><a href="us.com">us</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">255.255.255.0<a></td>
            <td><img src="uk.gif" /><a href="br.com">br</a></td>
        </tr>
    </tbody>
</table>

Run Code Online (Sandbox Code Playgroud)

下面的小代码从两个td行中提取文本,但我只需要IP数据,而不是IP和国家/地区数据:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))

table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)

Run Code Online (Sandbox Code Playgroud)

这个输出:

127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br

Run Code Online (Sandbox Code Playgroud)

我需要的是IP table.tbody.tr.td.a元素文本而不是国家table.tbody.tr.td.img.a元素.

是否有任何有经验的BeautifulSoup用户会对如何选择和提取有所了解.

谢谢.

Answer 1

m.w*_*ski 1

首先搜索<td>中的每一行tbody：

# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]

Run Code Online (Sandbox Code Playgroud)

或者也许更具可读性：

rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，7 月前
查看次数：	20395 次
最近记录：	11 年，7 月前