Rea*_*ith 5 html python beautifulsoup html-parsing
from BeautifulSoup import BeautifulSoup
html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>
<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''
soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)
Run Code Online (Sandbox Code Playgroud)
我有这个示例代码,但我找不到如何在删除的标签周围添加空格,因此当<a href...>格式化内部的文本时,它可以是可读的,并且不会像这样显示:
诗乌鸦从前沉闷的午夜,而我沉思,虚弱和疲倦......
在我们山谷中最绿的地方 由好天使租住......,鬼宫的一部分
moh*_*mad 11
get_text()inbeautifoulsoup4有一个名为 的可选输入separator。您可以按如下方式使用它:
soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')
Run Code Online (Sandbox Code Playgroud)
一种选择是找到所有文本节点并用空格将它们连接起来:
" ".join(item.strip() for item in poems.find_all(text=True))
Run Code Online (Sandbox Code Playgroud)
此外,您使用的beautifulsoup3软件包已过时且未维护。升级到beautifulsoup4:
pip install beautifulsoup4
Run Code Online (Sandbox Code Playgroud)
并替换:
from BeautifulSoup import BeautifulSoup
Run Code Online (Sandbox Code Playgroud)
和:
from bs4 import BeautifulSoup
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2884 次 |
| 最近记录: |