sia*_*mii 0 python regex parsing html-parsing
我有这样的HTML文本
<tr>
<td><strong>Turnover</strong></td>
<td width="20%" class="currency">£348,191</td>
<td width="20%" class="currency">£856,723</td>
<td width="20%" class="currency">£482,177</td>
</tr>
<tr>
<td> Cost of sales</td>
<td width="20%" class="currency">£275,708</td>
<td width="20%" class="currency">£671,345</td>
<td width="20%" class="currency">£357,587</td>
</tr>
<tr>
Run Code Online (Sandbox Code Playgroud)
它之前和之后都有很多html.我想解析这些数字.可以有不同数量的td列,所以我想解析所有列.在这种情况下,有三列,所以我要找的结果是:
[348191, 856723, 482177]
Run Code Online (Sandbox Code Playgroud)
理想情况下,我想将数据Turnover和Cost of Sales数据分别解析为不同的变量
你可以使用BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """ <tr>
... <td><strong>Turnover</strong></td>
... <td width="20%" class="currency">£348,191</td>
... <td width="20%" class="currency">£856,723</td>
... <td width="20%" class="currency">£482,177</td>
... </tr>
... <tr>
... <td> Cost of sales</td>
... <td width="20%" class="currency">£275,708</td>
... <td width="20%" class="currency">£671,345</td>
... <td width="20%" class="currency">£357,587</td>
... </tr>"""
>>> soup = BS(html)
>>> for i in soup.find_all('tr'):
... if i.find('td').text == "Turnover":
... for x in i.find_all('td', {'class':'currency'}):
... print x.text
...
£348,191
£856,723
£482,177
Run Code Online (Sandbox Code Playgroud)
首先,我们将HTML转换为一种bs4我们可以轻松浏览的类型.find_all没有奖品可以猜测它的作用,找到所有的<tr>s.
我们遍历每个tr,如果第一个<td>是Turnover,那么我们将通过其余的<td>s.
我们遍历每个td与class="currency"并打印其内容.