使用Python修复HTML标签括号

hen*_*nry 5 python string

我有很多HTML文字,例如

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub  in this text here and another one <sub> here /sub> .'
Run Code Online (Sandbox Code Playgroud)

有时HTML标签,比如<sub></sub>缺少的<括号内。这会在以后的代码中导致困难。现在,我的问题是:如何才能智能地检测出那些缺失的支架并进行修复?

正确的文本为:

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub>  in this text here and another one <sub> here </sub> .'
Run Code Online (Sandbox Code Playgroud)

当然,我可以对所有可能的括号配置进行硬编码,但这会花费很长时间,因为我的文字中存在更多类似的错误。

text = re.sub( r'</sub ', r'</sub>', text) 
text = re.sub( r' /sub>', r'</sub>', text)
Run Code Online (Sandbox Code Playgroud)

...并且之前的代码可能会添加另一个括号来更正示例。

ggo*_*len 1

好问题!这里有一种解决方案,它不会sub对单词进行硬编码,并且适用于任意标签,只要只缺少一个括号并且 HTML 标签不包含任何属性(否则,我们如何知道标签何时应该关闭?我们可以使用attr=""格式,但它变得危险)。此外,标签不需要像您的示例所示那样以空格分隔,这在 HTML 中并不常见。


代码

import re

def repair(text, backwards=False):
    left_bracket, right_bracket = "<", ">"

    if backwards:
        left_bracket, right_bracket = ">", "<"

    i = 0

    while i < len(text):
        if text[i] == left_bracket:
            j = i + 1

            while j < len(text) and re.match(r"[/\w]", text[j]):
                j += 1

                if backwards and text[j-1] == "/":
                    break

            if j >= len(text) or text[j] != right_bracket:
                text = text[:j] + right_bracket + text[j:]

            i = j

        i += 1

    return text

def repair_tags(html):
    return repair(repair(html[::-1], True)[::-1])
Run Code Online (Sandbox Code Playgroud)

测试

if __name__ == "__main__":
    original = '''<li>
    <a>
        About Us
        <span>
            Learn more about Stack Overflow the company
        </span>
    </a>
</li>
<li>
    <a>
        Business
        <span>
            Learn more about hiring developers or posting ads with us
        </span>
    </a>
</li>'''
    corrupted = '''li>
    <a
        About Us
        span>
            Learn more about Stack Overflow the company
        </span
    </a
/li>
<li
    <a
        Business
        span>
            Learn more about hiring developers or posting ads with us
        /span>
    </a
</li'''

    print(repair_tags(corrupted))
    print("repaired matches original?", repair_tags(corrupted) == original)
Run Code Online (Sandbox Code Playgroud)

输出

import re

def repair(text, backwards=False):
    left_bracket, right_bracket = "<", ">"

    if backwards:
        left_bracket, right_bracket = ">", "<"

    i = 0

    while i < len(text):
        if text[i] == left_bracket:
            j = i + 1

            while j < len(text) and re.match(r"[/\w]", text[j]):
                j += 1

                if backwards and text[j-1] == "/":
                    break

            if j >= len(text) or text[j] != right_bracket:
                text = text[:j] + right_bracket + text[j:]

            i = j

        i += 1

    return text

def repair_tags(html):
    return repair(repair(html[::-1], True)[::-1])
Run Code Online (Sandbox Code Playgroud)

怎么运行的

遍历字符串寻找括号字符。找到后,向前走,直到碰到字符串末尾或遇到非单词字符。如果搜索到达字符串末尾或当前非单词字符不是正确的伴随括号,则放置一个伴随括号。

/然后,对反转的字符串执行相同的操作,切换目标括号并稍微调整以在寻找结束标记位置时中断。

由于字符串的构建,时间复杂度并不高。毫无疑问,有一个简单的正则表达式,所以将此作为概念证明。

尝试一下!