43T*_*cts 2 python regex beautifulsoup
使用 BeautifulSoup 的 prettify 后,我想删除周围的换行符和缩进span,也许还有其他内联标签。
例如,我目前有这样的东西:
>>> import bs4
>>> html = "<div><p>I don't want this <span>span element</span> on it's one line.</p></div>"
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> soup.prettify()
"<div>\n <p>\n I don't want this\n <span>\n span element\n </span>\n on its one line.\n </p>\n</div>"
>>> print(soup.prettify())
<div>
<p>
I don't want this
<span>
span element
</span>
on it's one line.
</p>
</div>
Run Code Online (Sandbox Code Playgroud)
我可以使用什么正则表达式来删除跨度标签周围的缩进空格和换行符,以便我最终得到以下结果:
<div>
<p>
I don't want this <span>span element</span> on its one line.
</p>
</div>
Run Code Online (Sandbox Code Playgroud)
我知道这已经很老了,穆罕默德已经提供了一个出色的答案。不过,我想补充一下。
虽然这很有效,但我发现它错过了在标签之前或之后有换行符的标签(但不是两者)。我认为这是由于 ' +' 字符 中的' ' 字符所致[ \n]+,它表示匹配一个或多个\n或空格(因此,如果标签在其前后没有一个或多个空格或换行符,则它将不匹配) 。我使用了他的方法,但将其改为以下内容:
# removing space before and after <span> tag
html = re.sub('\s*<span>\s*','<span>', html)
# removing space before and after </span> tag
html = re.sub('\s*</span>\s*','</span>', html)
Run Code Online (Sandbox Code Playgroud)
\s将匹配任何空白字符(空格、制表符、换行符等),并且表示*匹配 0 个或多个这些字符(因此,如果仅在标记的一侧有空白字符,它仍然会匹配)
此外,如果您的任何元素具有属性(即<span class="myclass">),您还需要做一些额外的事情:
'''
[^>]* says match 0 or more of any characters OTHER than >,
so you'd match: <span>, <span class="a">, <span style="display:hidden;">, etc...)
(the parens around it, store it in a capture class,
so it can be inserted in the replacement.)
'''
tag_regex = re.compile('\s*<span([^>]*)>\s*')
'''
the \1 inserts what was captured by the regex
(the [^>*], our attributes, if any) in to the replacement text.
'''
html = tag_regex.sub('<span\\1>', html)
# removing space before and after </span> tag
html = re.sub('\s*</span>\s*','</span>', html)
Run Code Online (Sandbox Code Playgroud)
将所有这些放在适用于任何标签的 Mohamed 通用函数中,我们得到:
import re
from bs4 import BeautifulSoup
def prettify_output(html, tag):
reg_tag = re.compile(f'\s*<{tag}([^>]*)>\s*')
html = reg_tag.sub(f'<{tag}\\1>', html)
html = re.sub(f'\s*</{tag}>\s*',f'</{tag}>', html)
return html
html = BeautifulSoup("<body><div><div><span class='a'>dont</span><span class='b'>split me!</span></div></div></body>", 'html.parser')
html = html.prettify() # or however you call BeautifulSoup's prettify
html = prettify_output(html, 'span')
Run Code Online (Sandbox Code Playgroud)
产生输出:
<body>
<div>
<div><span class="a">dont</span><span class="b">split me!</span></div>
</div>
</body>
Run Code Online (Sandbox Code Playgroud)
一个改进是保留 before<span *stuff*>和 after的换行符</span>,除非它们嵌套在其他跨度之间/旁边(这将保留<div>和 之间<span class="a">以及</span>和</div>tag 之间的美化换行符 - 目前该函数将这些换行符删除,因为它正在删除所有这些标签之前和之后的空白字符。)