使用Python修复HTML标签括号

Question

使用Python修复HTML标签括号

我有很多HTML文字，例如

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub  in this text here and another one <sub> here /sub> .'

Run Code Online (Sandbox Code Playgroud)

有时HTML标签，比如<sub>，</sub>缺少的<括号内。这会在以后的代码中导致困难。现在，我的问题是：如何才能智能地检测出那些缺失的支架并进行修复？

正确的文本为：

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub>  in this text here and another one <sub> here </sub> .'

Run Code Online (Sandbox Code Playgroud)

当然，我可以对所有可能的括号配置进行硬编码，但这会花费很长时间，因为我的文字中存在更多类似的错误。

text = re.sub( r'</sub ', r'</sub>', text) 
text = re.sub( r' /sub>', r'</sub>', text)

Run Code Online (Sandbox Code Playgroud)

...并且之前的代码可能会添加另一个括号来更正示例。

Answer 1

ggo*_*len 1

好问题！这里有一种解决方案，它不会sub对单词进行硬编码，并且适用于任意标签，只要只缺少一个括号并且 HTML 标签不包含任何属性（否则，我们如何知道标签何时应该关闭？我们可以使用attr=""格式，但它变得危险）。此外，标签不需要像您的示例所示那样以空格分隔，这在 HTML 中并不常见。

代码

import re

def repair(text, backwards=False):
    left_bracket, right_bracket = "<", ">"

    if backwards:
        left_bracket, right_bracket = ">", "<"

    i = 0

    while i < len(text):
        if text[i] == left_bracket:
            j = i + 1

            while j < len(text) and re.match(r"[/\w]", text[j]):
                j += 1

                if backwards and text[j-1] == "/":
                    break

            if j >= len(text) or text[j] != right_bracket:
                text = text[:j] + right_bracket + text[j:]

            i = j

        i += 1

    return text

def repair_tags(html):
    return repair(repair(html[::-1], True)[::-1])

Run Code Online (Sandbox Code Playgroud)

测试

if __name__ == "__main__":
    original = '''<li>
    <a>
        About Us
        <span>
            Learn more about Stack Overflow the company
        </span>
    </a>
</li>
<li>
    <a>
        Business
        <span>
            Learn more about hiring developers or posting ads with us
        </span>
    </a>
</li>'''
    corrupted = '''li>
    <a
        About Us
        span>
            Learn more about Stack Overflow the company
        </span
    </a
/li>
<li
    <a
        Business
        span>
            Learn more about hiring developers or posting ads with us
        /span>
    </a
</li'''

    print(repair_tags(corrupted))
    print("repaired matches original?", repair_tags(corrupted) == original)

Run Code Online (Sandbox Code Playgroud)

输出

import re

def repair(text, backwards=False):
    left_bracket, right_bracket = "<", ">"

    if backwards:
        left_bracket, right_bracket = ">", "<"

    i = 0

    while i < len(text):
        if text[i] == left_bracket:
            j = i + 1

            while j < len(text) and re.match(r"[/\w]", text[j]):
                j += 1

                if backwards and text[j-1] == "/":
                    break

            if j >= len(text) or text[j] != right_bracket:
                text = text[:j] + right_bracket + text[j:]

            i = j

        i += 1

    return text

def repair_tags(html):
    return repair(repair(html[::-1], True)[::-1])

Run Code Online (Sandbox Code Playgroud)

怎么运行的

遍历字符串寻找括号字符。找到后，向前走，直到碰到字符串末尾或遇到非单词字符。如果搜索到达字符串末尾或当前非单词字符不是正确的伴随括号，则放置一个伴随括号。

/然后，对反转的字符串执行相同的操作，切换目标括号并稍微调整以在寻找结束标记位置时中断。

由于字符串的构建，时间复杂度并不高。毫无疑问，有一个简单的正则表达式，所以将此作为概念证明。

尝试一下！

归档时间：	6 年，10 月前
查看次数：	81 次
最近记录：	6 年，10 月前