ade*_*hus 10 c# regex performance
下面的代码包含一个用于提取C#字符串文字的正则表达式,但是对于多个字符的输入字符串,正则表达式匹配的性能很糟糕.
class Program
{
private static void StringMatch(string s)
{
// regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
Match m = Regex.Match(s, "\"(([^\\\\\"]*)(\\\\.)?)*\"");
if (m.Success)
Trace.WriteLine(m.Value);
else
Trace.WriteLine("no match");
}
public static void Main()
{
// this first string is unterminated (so the match fails), but it returns instantly
StringMatch("\"OK");
// this string is terminated (the match succeeds)
StringMatch("\"This is a longer terminated string - it matches and returns instantly\"");
// this string is unterminated (so the match will fail), but it never returns
StringMatch("\"This is another unterminated string and takes FOREVER to match");
}
}
Run Code Online (Sandbox Code Playgroud)
我可以将正则表达式重构为另一种形式,但任何人都可以解释为什么表现如此糟糕?
Tim*_*ker 13
你遇到了灾难性的回溯:
让我们简化一下正则表达式(没有转义引号,也没有第二个可选组,因为在评论中,它与测试的字符串无关):
"(([^\\"]*))*"
Run Code Online (Sandbox Code Playgroud)
([^\\"]*)匹配除引号或反斜杠之外的任何字符串.这再次包含在可重复任意次数的可选组中.
现在对于字符串"ABC,正则表达式引擎需要尝试以下排列:
", ABC ",ABC,<empty string> ",AB,C ",AB,C,<empty string> ",AB,<empty string>,C ",AB,<empty string>,C,<empty string> ",<empty string>,AB,C ",<empty string>,AB,C,<empty string> ",<empty string>,AB,<empty string>,C,<empty string> ",<empty string>,AB,<empty string>,C ",A,BC ",A,BC,<empty string> ",A,<empty string>,BC ",<empty string>,A,BC ",A,B,C ",A,B,C,<empty string> ",A,B,<empty string>,C 然后每个都失败,因为没有以下".
此外,您只测试子字符串而不是强制正则表达式匹配整个字符串.并且您通常希望使用逐字符串来表示正则表达式,以减少所需的反斜杠数量.这个怎么样:
foundMatch = Regex.IsMatch(subjectString,
@"\A # Start of the string
"" # Match a quote
(?: # Either match...
\\. # an escaped character
| # or
[^\\""] # any character except backslash or quote
)* # any number of times
"" # Match a quote
\Z # End of the string",
RegexOptions.IgnorePatternWhitespace);
Run Code Online (Sandbox Code Playgroud)