如何在某些标签之间获取文本和替换文本

mad*_*ops 1 html python regex html-parsing

给出一个字符串

"<p> >this line starts with an arrow <br /> this line does not </p>"
Run Code Online (Sandbox Code Playgroud)

要么

"<p> >this line starts with an arrow </p> <p> this line does not </p>"
Run Code Online (Sandbox Code Playgroud)

如何找到以箭头开头的行并用div包围它们

这样就变成了:

"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>
Run Code Online (Sandbox Code Playgroud)

ale*_*cxe 6

由于它是您正在解析的HTML,因此请使用该工具进行工作 - 一个HTML解析器,例如BeautifulSoup.

使用find_all()查找以启动所有文本节点>wrap()它们与新的div标签:

from bs4 import BeautifulSoup

data = "<p> >this line starts with an arrow <br /> this line does not </p>"

soup = BeautifulSoup(data)
for item in soup.find_all(text=lambda x: x.strip().startswith('>')):
    item.wrap(soup.new_tag('div'))

print soup.prettify()
Run Code Online (Sandbox Code Playgroud)

打印:

<p>
    <div>
    >this line starts with an arrow
    </div>
    <br/>
    this line does not
</p>
Run Code Online (Sandbox Code Playgroud)