我有很多HTML文字,例如
text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub in this text here and another one <sub> here /sub> .'
Run Code Online (Sandbox Code Playgroud)
有时HTML标签,比如<sub>,</sub>缺少的<括号内。这会在以后的代码中导致困难。现在,我的问题是:如何才能智能地检测出那些缺失的支架并进行修复?
正确的文本为:
text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub> in this text here and another one <sub> here </sub> .'
Run Code Online (Sandbox Code Playgroud)
当然,我可以对所有可能的括号配置进行硬编码,但这会花费很长时间,因为我的文字中存在更多类似的错误。
text = re.sub( r'</sub ', r'</sub>', text)
text = re.sub( r' /sub>', r'</sub>', text)
Run Code Online (Sandbox Code Playgroud)
...并且之前的代码可能会添加另一个括号来更正示例。
好问题!这里有一种解决方案,它不会sub对单词进行硬编码,并且适用于任意标签,只要只缺少一个括号并且 HTML 标签不包含任何属性(否则,我们如何知道标签何时应该关闭?我们可以使用attr=""格式,但它变得危险)。此外,标签不需要像您的示例所示那样以空格分隔,这在 HTML 中并不常见。
import re
def repair(text, backwards=False):
left_bracket, right_bracket = "<", ">"
if backwards:
left_bracket, right_bracket = ">", "<"
i = 0
while i < len(text):
if text[i] == left_bracket:
j = i + 1
while j < len(text) and re.match(r"[/\w]", text[j]):
j += 1
if backwards and text[j-1] == "/":
break
if j >= len(text) or text[j] != right_bracket:
text = text[:j] + right_bracket + text[j:]
i = j
i += 1
return text
def repair_tags(html):
return repair(repair(html[::-1], True)[::-1])
Run Code Online (Sandbox Code Playgroud)
if __name__ == "__main__":
original = '''<li>
<a>
About Us
<span>
Learn more about Stack Overflow the company
</span>
</a>
</li>
<li>
<a>
Business
<span>
Learn more about hiring developers or posting ads with us
</span>
</a>
</li>'''
corrupted = '''li>
<a
About Us
span>
Learn more about Stack Overflow the company
</span
</a
/li>
<li
<a
Business
span>
Learn more about hiring developers or posting ads with us
/span>
</a
</li'''
print(repair_tags(corrupted))
print("repaired matches original?", repair_tags(corrupted) == original)
Run Code Online (Sandbox Code Playgroud)
import re
def repair(text, backwards=False):
left_bracket, right_bracket = "<", ">"
if backwards:
left_bracket, right_bracket = ">", "<"
i = 0
while i < len(text):
if text[i] == left_bracket:
j = i + 1
while j < len(text) and re.match(r"[/\w]", text[j]):
j += 1
if backwards and text[j-1] == "/":
break
if j >= len(text) or text[j] != right_bracket:
text = text[:j] + right_bracket + text[j:]
i = j
i += 1
return text
def repair_tags(html):
return repair(repair(html[::-1], True)[::-1])
Run Code Online (Sandbox Code Playgroud)
遍历字符串寻找括号字符。找到后,向前走,直到碰到字符串末尾或遇到非单词字符。如果搜索到达字符串末尾或当前非单词字符不是正确的伴随括号,则放置一个伴随括号。
/然后,对反转的字符串执行相同的操作,切换目标括号并稍微调整以在寻找结束标记位置时中断。
由于字符串的构建,时间复杂度并不高。毫无疑问,有一个简单的正则表达式,所以将此作为概念证明。