Python正则表达式-当它在html标签中时不匹配单词

Question

Python正则表达式-当它在html标签中时不匹配单词

如果它在 html 标签中，我需要编写与单词不匹配的正则表达式。

这是文本示例：

asdd qwe <a href="http://example.com" title="Some title with word qwe" class="external-link" rel="nofollow">  qwe

Run Code Online (Sandbox Code Playgroud)

我的正则表达式现在看起来像这样：

(?!(\<.+))[^a-zA-Z?????ó????????Ó???](<class="bad-word"(?: style="[^"]+")?>)?(qwe)(<>)?[^a-zA-Z?????ó????????Ó???](?!.+\>)

Run Code Online (Sandbox Code Playgroud)

这有点复杂，但everythink 的工作期望当我在 regex101.com 和 regexr.com 上测试它时，它只匹配 html 标签之后的单词。

知道为什么吗？

编辑：

我不想使用 html 解析器或 DOM 操作，我不想更改这么多代码。

def test_tagged_word_present(self):
    input = 'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"> qwe some other words'
    expected = 'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"><strong class="bad-word" style="color:red">qwe</strong> some other words'
    parser = self.get_test_parser(input, search_word='qwe')
    text = parser.mark_words()
    self.assertEqual(text, expected)

Run Code Online (Sandbox Code Playgroud)

一切正常，除了正则表达式仍然缓存qwe在标题中。

Answer 1

Jer*_*nes 5

要排除 HTML 标签中的内容，一个很好的技巧是使用“不跟”并在其中包含尖括号字符。例如，您的正则表达式以以下内容结尾：

(?!.+\>)

Run Code Online (Sandbox Code Playgroud)

这大概应该意味着“后面没有一个或多个字符和一个右尖括号”。

然而，“一个或多个字符”太宽泛了，匹配的比你想要的更多：如果你让它更严格一点，那么它就不会那么贪婪：

(?![^<>]*>)

Run Code Online (Sandbox Code Playgroud)

所以这不是“后跟非尖括号和右括号”。

这样它只会在它在 HTML 标签之外时才进行替换，因为如果它在里面，那么它将匹配，所以后面的 NOT 将阻止它替换。

您可能还需要在其他字符类中包含 <> 以限制它们。

请注意，这并非严格 100% 合规，因为属性中可以合法地包含这些字符，但是在许多情况下，您对输入有足够的了解，可以安全地使用 [^<>] 来简化任务而不会造成任何问题.

$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> mystring = 'asdd qwe <a href="http://example.com" title="Some title with word qwe" class="external-link" rel="nofollow">  qwe '
>>> import re
>>> p=re.compile(r'([^\s<>]+)(?![^<>]*>)')
>>> p.findall(mystring)
['asdd', 'qwe', 'qwe']
>>>
$

Run Code Online (Sandbox Code Playgroud)

第二次测试：

$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> mystring = r'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"> qwe some other words'
>>> p=re.compile(r'([^\s<>]+)(?![^<>]*>)')
>>> p.findall(mystring)
['words', 'qwe', 'some', 'other', 'words']
>>> mystring = r'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"> qwe <strong class="bad-word" style="color:red">podmiotu</strong> some other words'
>>> p.findall(mystring)
['words', 'qwe', 'podmiotu', 'some', 'other', 'words']
>>>

Run Code Online (Sandbox Code Playgroud)

请注意，'qwe' 在两个字符串中，在 HTML 标签之外，所以我认为它应该匹配。

要搜索特定单词，只需在正则表达式中使用它：

如果它在 HTML 之外，找到单词“some”：

>>> p=re.compile(r'(some)(?![^<>]*>)')
>>> p.findall(mystring)
['some']
>>>

Run Code Online (Sandbox Code Playgroud)

如果它在 HTML 之外（失败，正确），找到单词“external”：

>>> p=re.compile(r'(external)(?![^<>]*>)')
>>> p.findall(mystring)
[]
>>>

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，1 月前
查看次数：	1120 次
最近记录：	10 年，1 月前