Kje*_*din 6 python regex utf-8 special-characters
我有一个selenium/python项目,它使用正则表达式匹配来查找html元素.这些元素属性有时包括丹麦语/挪威语字符ÆØÅ.问题出在下面的代码段中:
if (re.match(regexp_expression, compare_string)):
result = True
else :
result = False
Run Code Online (Sandbox Code Playgroud)
无论是regex_expression与compare_string执行正则表达式匹配之前被操纵.如果我在执行上面的代码片段之前打印它们,并打印结果,我会得到以下输出:
Regex_expression: [^log på$]
compare string: [log på]
result = false
Run Code Online (Sandbox Code Playgroud)
我把括号放在上面以确保没有空格.它们只是print语句的一部分,而不是String变量的一部分.
但是,如果我尝试在单独的脚本中重现问题,如下所示:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
regexp_expression = "^log på$"
compare_string = "log på"
if (re.match(regexp_expression, compare_string)):
print("result true")
result = True
else :
print("result = false")
result = False
Run Code Online (Sandbox Code Playgroud)
那么结果是真的.
怎么会这样?为了使它更奇怪,它更早地工作,我不确定我编辑的是什么让它变得繁荣......
正则表达式比较方法的完整模块如下所示.我自己没有编写这个,所以我不是100%熟悉所有替换语句和字符串操作的原因,但我认为它应该无关紧要,当我可以在失败的匹配方法之前检查字符串在底部...
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
def regexp_compare(regexp_expression, compare_string):
#final int DOTALL
#try: // include try catch for "PatternSyntaxException" while testing/including a new symbol in this method..
#catch(PatternSyntaxException e):
# System.out.println("Regexp>>"+regexp_expression)
# e.printStackTrace()
#*/
if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())):
print("return 1")
return True
if(not compare_string or not regexp_expression):
print("return 2")
return False
regexp_expression = regexp_expression.lower()
compare_string = compare_string.lower()
if(not regexp_expression.strip()):
regexp_expression = ""
if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())):
regexp_expression = ""
else:
regexp_expression = regexp_expression.replace("\\","\\\\")
regexp_expression = regexp_expression.replace("\\.","\\\\.")
regexp_expression = regexp_expression.replace("\\*", ".*")
regexp_expression = regexp_expression.replace("\\(", "\\\\(")
regexp_expression = regexp_expression.replace("\\)", "\\\\)")
regexp_expression_arr = regexp_expression.split("|")
regexp_expression = ""
for i in range(0, len(regexp_expression_arr)):
if(not(regexp_expression_arr[i].startswith("^"))):
regexp_expression_arr[i] = "^"+regexp_expression_arr[i]
if(not(regexp_expression_arr[i].endswith("$"))):
regexp_expression_arr[i] = regexp_expression_arr[i]+"$"
regexp_expression = regexp_expression_arr[i] if regexp_expression == "" else regexp_expression+"|"+regexp_expression_arr[i]
result = None
print("Regex_expression: [" + regexp_expression+"]")
print("compare string: [" + compare_string+"]")
if (re.match(regexp_expression, compare_string)):
print("result true")
result = True
else :
print("result = false")
result = False
print("return result")
return result
Run Code Online (Sandbox Code Playgroud)
您可能正在将 unicode 字符串与非 unicode 字符串进行比较。
\n\n例如,在以下内容中:
\n\n#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nimport re\n\nregexp_expression = "^log p\xc3\xa5$"\ncompare_string = u"log p\xc3\xa5"\n\nif (re.match(regexp_expression, compare_string)):\n print("result true")\n result = True\nelse :\n print("result = false")\n result = False\nRun Code Online (Sandbox Code Playgroud)\n\n您将得到输出 False。因此,在您的操作中可能有一个点不是 unicode。
\n\n同样的错误也会导致以下结果:
\n\nregexp_expression = u"^log p\xc3\xa5$"\ncompare_string = "log p\xc3\xa5"\nRun Code Online (Sandbox Code Playgroud)\n