Nem*_*XXX 1 html python regex auto-increment html-parsing
我想将id添加到html标签.例如,我想改变:
<p>First paragraph</p>
<p>Second paragraph</p>
<p>Third paragraph</p>
Run Code Online (Sandbox Code Playgroud)
至
<p id="1">First paragraph</p>
<p id="2">Second paragraph</p>
<p id="3">Third paragraph</p>
Run Code Online (Sandbox Code Playgroud)
IIRC,可以使用lambda函数来实现此功能,但我不记得确切的语法.
我会使用HTML解析器,比如BeautifulSoup.
我们的想法是使用enumerate()索引迭代所有段落,从以下开始1:
from bs4 import BeautifulSoup
data = """
<p>First paragraph</p>
<p>Second paragraph</p>
<p>Third paragraph</p>
"""
soup = BeautifulSoup(data, 'html.parser')
for index, p in enumerate(soup.find_all('p'), start=1):
p['id'] = index
print soup
Run Code Online (Sandbox Code Playgroud)
打印:
<p id="1">First paragraph</p>
<p id="2">Second paragraph</p>
<p id="3">Third paragraph</p>
Run Code Online (Sandbox Code Playgroud)
如果您想使用正则表达式,快速但肮脏的解决方案是使用全局变量,如下所示:
i = 0
def replace(match):
global i
i += 1
return '<p id="{0}">'.format(i)
re.sub(pattern, replace, your_string)
Run Code Online (Sandbox Code Playgroud)
或者,您可以创建一个自定义类,“假装”为一个函数,使用__call__并将计数器定义为一个字段:
class Replace(object):
def __init__(self):
self.counter = 0
def __call__(self, match):
self.counter += 1
return '<p id="{0}">'.format(self.counter)
replace = Replace()
re.sub(pattern, replace, your_string)
Run Code Online (Sandbox Code Playgroud)