如何删除BeautifulSoup中的所有不同脚本标记？

Question

如何删除BeautifulSoup中的所有不同脚本标记？

Spa*_*ine 5 html python beautifulsoup html-parsing

我从Web链接爬行表,并希望通过删除所有脚本标记来重建表.这是源代码.

response = requests.get(url)
soup = BeautifulSoup(response.text)
table = soup.find('table')

for row in table.find_all('tr') :                                                                                                                                                                                                                                                                                                                                                                                                     
    for col in row.find_all('td'):
        #remove all different script tags
        #col.replace_with('') 
        #col.decompose()  
        #col.extract()
        col = col.contents

Run Code Online (Sandbox Code Playgroud)

如何删除所有不同的脚本标记？采取后续电池作为exampple,其中包括标记a,br和td.

<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>

Run Code Online (Sandbox Code Playgroud)

我的预期结果是:

Signal et Communication
Ingénierie Réseaux et Télécommunications

Run Code Online (Sandbox Code Playgroud)

Answer 1

ale*_*cxe 5

你在问get_text():

如果您只需要文档或标记的文本部分,则可以使用该 get_text()方法.它返回文档中或标记下的所有文本,作为单个Unicode字符串

td = soup.find("td")
td.get_text()

Run Code Online (Sandbox Code Playgroud)

请注意,在这种情况下.string会返回你None,因为td有多个孩子:

如果一个标签包含多个东西,那么它不清楚.string应该引用什么 ,所以.string定义为None

演示:

>>> from bs4 import BeautifulSoup
>>> 
>>> soup = BeautifulSoup(u"""
... <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
... <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
... </td>
... """)
>>> 
>>> td = soup.td
>>> print td.string
None
>>> print td.get_text()
Signal et Communication
Ingénierie Réseaux et Télécommunications

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，10 月前
查看次数：	571 次
最近记录：	9 年，11 月前