如何在 BeautifulSoup 中删除标签周围添加空间

Question

如何在 BeautifulSoup 中删除标签周围添加空间

Rea*_*ith 5 html python beautifulsoup html-parsing

from BeautifulSoup import BeautifulSoup

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''


soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)

Run Code Online (Sandbox Code Playgroud)

我有这个示例代码，但我找不到如何在删除的标签周围添加空格，因此当<a href...>格式化内部的文本时，它可以是可读的，并且不会像这样显示：

诗乌鸦从前沉闷的午夜，而我沉思，虚弱和疲倦......

在我们山谷中最绿的地方由好天使租住......，鬼宫的一部分

Answer 1

moh*_*mad 11

get_text()inbeautifoulsoup4有一个名为的可选输入separator。您可以按如下方式使用它：

soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')

Run Code Online (Sandbox Code Playgroud)

Answer 2

ale*_*cxe 2

一种选择是找到所有文本节点并用空格将它们连接起来：

" ".join(item.strip() for item in poems.find_all(text=True))

Run Code Online (Sandbox Code Playgroud)

此外，您使用的beautifulsoup3软件包已过时且未维护。升级到beautifulsoup4：

pip install beautifulsoup4

Run Code Online (Sandbox Code Playgroud)

并替换：

from BeautifulSoup import BeautifulSoup

Run Code Online (Sandbox Code Playgroud)

和：

from bs4 import BeautifulSoup

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，4 月前
查看次数：	2884 次
最近记录：	6 年，10 月前