如何用beautifulsoup4提取HTML？

Question

如何用beautifulsoup4提取HTML？

html看起来像这样:

<td class='Thistd'><a ><img /></a>Here is some text.</td>

Run Code Online (Sandbox Code Playgroud)

我只想得到字符串<td>.我不需要<a>...</a>.我怎样才能做到这一点？

我的代码:

from bs4 import BeautifulSoup
html = """<td class='Thistd'><a><img /></a>Here is some text.</td>"""

soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
    print td
    print '============='

Run Code Online (Sandbox Code Playgroud)

我得到的是 <td class='Thistd'><a ><img /></a>Here is some text.</td>

但我只是需要 Here is some text.

Answer 1

The*_*nse 5

码:

from bs4 import BeautifulSoup
html = """<td class='Thistd'><a ><img /></a>Here is some text.</td>"""

soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
    print td.text#the only change you need to do
    print '============='

Run Code Online (Sandbox Code Playgroud)

输出:

Here is some text.
=============

Run Code Online (Sandbox Code Playgroud)

注意:

将.text用于获取在这种情况下,给定的BS4对象只有文本属性被td标记.对于更多信息,它着眼于官方网站

归档时间：	10 年，1 月前
查看次数：	68 次
最近记录：	10 年，1 月前