慢速正则表达性能

ade*_*hus 10 c# regex performance

下面的代码包含一个用于提取C#字符串文字的正则表达式,但是对于多个字符的输入字符串,正则表达式匹配的性能很糟糕.

class Program
{
   private static void StringMatch(string s)
    {
        // regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
        Match m = Regex.Match(s, "\"(([^\\\\\"]*)(\\\\.)?)*\"");
        if (m.Success)
            Trace.WriteLine(m.Value);
        else
            Trace.WriteLine("no match");
    }

    public static void Main()
    {
        // this first string is unterminated (so the match fails), but it returns instantly
        StringMatch("\"OK");

        // this string is terminated (the match succeeds)
        StringMatch("\"This is a longer terminated string - it matches and returns instantly\"");

        // this string is unterminated (so the match will fail), but it never returns
        StringMatch("\"This is another unterminated string and takes FOREVER to match");
    }
}
Run Code Online (Sandbox Code Playgroud)

我可以将正则表达式重构为另一种形式,但任何人都可以解释为什么表现如此糟糕?

Tim*_*ker 13

你遇到了灾难性的回溯:

让我们简化一下正则表达式(没有转义引号,也没有第二个可选组,因为在评论中,它与测试的字符串无关):

"(([^\\"]*))*" 
Run Code Online (Sandbox Code Playgroud)

([^\\"]*)匹配除引号或反斜杠之外的任何字符串.这再次包含在可重复任意次数的可选组中.

现在对于字符串"ABC,正则表达式引擎需要尝试以下排列:

  • ", ABC
  • ",ABC,<empty string>
  • ",AB,C
  • ",AB,C,<empty string>
  • ",AB,<empty string>,C
  • ",AB,<empty string>,C,<empty string>
  • ",<empty string>,AB,C
  • ",<empty string>,AB,C,<empty string>
  • ",<empty string>,AB,<empty string>,C,<empty string>
  • ",<empty string>,AB,<empty string>,C
  • ",A,BC
  • ",A,BC,<empty string>
  • ",A,<empty string>,BC
  • ",<empty string>,A,BC
  • 等等
  • ",A,B,C
  • ",A,B,C,<empty string>
  • ",A,B,<empty string>,C
  • 等等

然后每个都失败,因为没有以下".

此外,您只测试子字符串而不是强制正则表达式匹配整个字符串.并且您通常希望使用逐字符串来表示正则表达式,以减少所需的反斜杠数量.这个怎么样:

foundMatch = Regex.IsMatch(subjectString, 
    @"\A     # Start of the string
    ""       # Match a quote
    (?:      # Either match...
     \\.     # an escaped character
    |        # or
     [^\\""] # any character except backslash or quote
    )*       # any number of times
    ""       # Match a quote
    \Z       # End of the string", 
    RegexOptions.IgnorePatternWhitespace);
Run Code Online (Sandbox Code Playgroud)