Rio*_*Rio 8 python regex beautifulsoup html-parsing
我有一堆HTML我正在使用BeautifulSoup进行解析,除了一个小问题之外它一直很顺利.我想将输出保存为单行字符串,以下作为我当前的输出:
<li><span class="plaincharacterwrap break">
Zazzafooky but one two three!
</span></li>
<li><span class="plaincharacterwrap break">
Zazzafooky2
</span></li>
<li><span class="plaincharacterwrap break">
Zazzafooky3
</span></li>
Run Code Online (Sandbox Code Playgroud)
理想情况下,我想
<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>
Run Code Online (Sandbox Code Playgroud)
我想摆脱很多冗余的空白,但它不一定是可移动的strip()
,我也不能公然删除所有空格,因为我需要保留文本.我该怎么做?这似乎是一个普遍的问题,正则表达式会有点矫枉过正,但这是唯一的方法吗?
我没有任何<pre>
标签,所以我可以在那里更有力量.
再次感谢!
twi*_*wig 13
老问题,我知道,但beautifulsoup4有这个帮手叫做stripped_strings.
试试这个:
description_el = about.find('p', { "class": "description" })
descriptions = list(description_el.stripped_strings)
description = "\n\n".join(descriptions) if descriptions else ""
Run Code Online (Sandbox Code Playgroud)
And*_*ark 11
如果没有正则表达式,您可以执行以下操作:
>>> html = """ <li><span class="plaincharacterwrap break">
... Zazzafooky but one two three!
... </span></li>
... <li><span class="plaincharacterwrap break">
... Zazzafooky2
... </span></li>
... <li><span class="plaincharacterwrap break">
... Zazzafooky3
... </span></li>
... """
>>> html = "".join(line.strip() for line in html.split("\n"))
>>> html
'<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'
Run Code Online (Sandbox Code Playgroud)