如果它们出现在html标记内,我该如何删除换行符？

Question

对不起,另一个python新手问题.我有一个字符串:

my_string = "<p>this is some \n fun</p>And this is \n some more fun!"

我想要:

my_string = "<p>this is some fun</p>And this is \n some more fun!"

换句话说,我该如何摆脱"\n"的唯一的,如果它发生在HTML标记内？

我有:

my_string = re.sub('<(.*?)>(.*?)\n(.*?)</(.*?)>', 'replace with what???', my_string)

哪个显然不起作用,但我被卡住了.

Answer 1

正则表达式与HTML不匹配.不要这样做.请参阅除XHTML自包含标记之外的RegEx匹配开放标记.

而是使用HTML解析器.Python附带了html.parser,或者您可以使用Beautiful Soup或html5lib.所有你需要做的就是走在树上并删除换行符.