如何将<br>和<p>转换为换行符?

TIM*_*MEX 12 html python regex xml

假设我的内部有HTML <p><br>标签.然后,我将剥离HTML来清理标签.如何将它们变成换行符?

我正在使用Python的BeautifulSoup库,如果这有帮助的话.

Mik*_*ton 15

没有一些细节,很难确定这完全符合你的要求,但这应该给你一个想法......它假设你的b标签包含在p元素中.

from BeautifulSoup import BeautifulSoup
import types

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, types.StringTypes):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text

page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""

soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
    line = replace_with_newlines(line)
    print line
Run Code Online (Sandbox Code Playgroud)

运行此结果导致......

(py26_default)[mpenning@Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt

Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning@Bucksnort ~]$
Run Code Online (Sandbox Code Playgroud)


Gen*_*wen 5

这是@Mike Pennington's Answer 的python3 版本(它真的很有帮助),我做了一个垃圾重构。

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, str):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text


def get_plain_text(soup):
    plain_text = ''
    lines = soup.find("body")
    for line in lines.findAll('p'):
        line = replace_with_newlines(line)
        plain_text+=line
    return plain_text
Run Code Online (Sandbox Code Playgroud)

要使用它,只需将 Beautifulsoup 对象传递给 get_plain_text 方法。

soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)
Run Code Online (Sandbox Code Playgroud)


nao*_*oko 5

get_text 似乎可以满足您的需求

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'
Run Code Online (Sandbox Code Playgroud)

  • 并非如此:get_text(separator ='\ n')在* all *标记后插入`separator`。因此,例如“这是一些&lt;i&gt;没有&lt;/ i&gt;换行符”变为“这是一些\ n没有\ n换行符”。是的,这很奇怪 (7认同)

Анд*_*рей -6

我不完全确定你想要完成什么,但如果你只是想删除 HTML 元素,我只会使用像Notepad2这样的程序并使用“全部替换”功能 - 我认为你也可以插入一个新行也使用全部替换。确保在替换该<p>元素时也删除了结尾 ( </p>)。另外仅供参考,正确的 HTML5 是<br />代替的<br>,但这并不重要。Python 不是我的首选,所以它有点超出了我的知识范围,抱歉我无法提供更多帮助。