用正则表达式匿名化html

Zna*_*kus 3 html regex anonymize

我正在尝试使用正则表达式使HTML字符串匿名化,以进行SQL查询。

https://regex101.com/r/QWt1E1/1

(?<!\<)[^<>\s](?!\>)
Run Code Online (Sandbox Code Playgroud)
<p><em>Hi [User</em></p>
<p><em>Tack f&ouml;r visat intresse.</em></p>
<p><em>Good luck!</em><em>&nbsp;</em></p>
<p><em>Sincerely</em></p>
Run Code Online (Sandbox Code Playgroud)
<p><em>nn nnnnn</nm></p>
<p><em>nnnn nnnnnnnn nnnnn nnnnnnnnn</nm></p>
<p><em>nnnn nnnnn</nm><em>nnnnnn</nm></p>
<p><em>nnnnnnnnn</nm></p>
Run Code Online (Sandbox Code Playgroud)

计划是用<代替所有不在<>内的字符n。它几乎可以工作,但是在我的示例中,它代替了ein </em>。不知道为什么以及如何解决。

如何调整正则表达式以不替换e示例中的?

Cer*_*nce 5

Negative lookahead for [^<>]*> instead of just >, to ensure that the current position is not followed by a > before any other angle brackets (because that would indicate you're currently inside a tag).

This also means that you can drop the lookbehind:

[^<>\s](?![^<>]*>)
          ^^^^^^
Run Code Online (Sandbox Code Playgroud)

https://regex101.com/r/QWt1E1/3

Still, it would be better to parse the HTML using an HTML parser, if at all possible