Python正则表达式匹配失败,带有UTF-8字符

Kje*_*din 6 python regex utf-8 special-characters

我有一个selenium/python项目,它使用正则表达式匹配来查找html元素.这些元素属性有时包括丹麦语/挪威语字符ÆØÅ.问题出在下面的代码段中:

if (re.match(regexp_expression, compare_string)):
    result = True
else :
    result = False
Run Code Online (Sandbox Code Playgroud)

无论是regex_expressioncompare_string执行正则表达式匹配之前被操纵.如果我在执行上面的代码片段之前打印它们,并打印结果,我会得到以下输出:

Regex_expression: [^log på$]
compare string: [log på]
result = false
Run Code Online (Sandbox Code Playgroud)

我把括号放在上面以确保没有空格.它们只是print语句的一部分,而不是String变量的一部分.

但是,如果我尝试在单独的脚本中重现问题,如下所示:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

regexp_expression  = "^log på$"
compare_string = "log på"

if (re.match(regexp_expression, compare_string)):
    print("result true")
    result = True
else :
    print("result = false")
    result = False
Run Code Online (Sandbox Code Playgroud)

那么结果是真的.

怎么会这样?为了使它更奇怪,它更早地工作,我不确定我编辑的是什么让它变得繁荣......

正则表达式比较方法的完整模块如下所示.我自己没有编写这个,所以我不是100%熟悉所有替换语句和字符串操作的原因,但我认为它应该无关紧要,当我可以在失败的匹配方法之前检查字符串在底部...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def regexp_compare(regexp_expression, compare_string):
    #final int DOTALL
    #try:    // include try catch for "PatternSyntaxException" while testing/including a new symbol in this method..

    #catch(PatternSyntaxException e):
    #    System.out.println("Regexp>>"+regexp_expression)
    #    e.printStackTrace()
    #*/


    if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())):
        print("return 1")
        return True                

    if(not compare_string or not regexp_expression):
        print("return 2")
        return False                

    regexp_expression = regexp_expression.lower()
    compare_string = compare_string.lower()

    if(not regexp_expression.strip()): 
        regexp_expression = ""

    if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())):
        regexp_expression = ""
    else:

        regexp_expression = regexp_expression.replace("\\","\\\\")
        regexp_expression = regexp_expression.replace("\\.","\\\\.")
        regexp_expression = regexp_expression.replace("\\*", ".*")
        regexp_expression = regexp_expression.replace("\\(", "\\\\(")
        regexp_expression = regexp_expression.replace("\\)", "\\\\)")           
        regexp_expression_arr = regexp_expression.split("|")
        regexp_expression = ""

        for i in range(0, len(regexp_expression_arr)):
            if(not(regexp_expression_arr[i].startswith("^"))):
                regexp_expression_arr[i] = "^"+regexp_expression_arr[i]

            if(not(regexp_expression_arr[i].endswith("$"))):
                regexp_expression_arr[i] = regexp_expression_arr[i]+"$"

            regexp_expression = regexp_expression_arr[i] if regexp_expression == "" else regexp_expression+"|"+regexp_expression_arr[i]  




    result = None        

    print("Regex_expression: [" + regexp_expression+"]")
    print("compare string: [" + compare_string+"]")

    if (re.match(regexp_expression, compare_string)):
        print("result true")
        result = True
    else :
        print("result = false")
        result = False

    print("return result")
    return result
Run Code Online (Sandbox Code Playgroud)

Cod*_*key 3

您可能正在将 unicode 字符串与非 unicode 字符串进行比较。

\n\n

例如,在以下内容中:

\n\n
#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nimport re\n\nregexp_expression  = "^log p\xc3\xa5$"\ncompare_string = u"log p\xc3\xa5"\n\nif (re.match(regexp_expression, compare_string)):\n    print("result true")\n    result = True\nelse :\n    print("result = false")\n    result = False\n
Run Code Online (Sandbox Code Playgroud)\n\n

您将得到输出 False。因此,在您的操作中可能有一个点不是 unicode。

\n\n

同样的错误也会导致以下结果:

\n\n
regexp_expression  = u"^log p\xc3\xa5$"\ncompare_string = "log p\xc3\xa5"\n
Run Code Online (Sandbox Code Playgroud)\n