为什么编译RegEx性能比Intrepreted RegEx慢?

dr.*_*vil 23 .net regex performance

我遇到这篇文章:

性能:编译与解释的正则表达式,我修改了示例代码以编译1000 Regex,然后每次运行500次以利用预编译,但即使在这种情况下解释的RegExes运行速度快4倍!

这意味着RegexOptions.Compiled选项完全没用,实际上更糟糕的是,它更慢!最大的区别是由于JIT,在解决JIT编译的正则表达式后,下面的代码仍然执行有点慢,对我来说没有意义,但@Jim在答案中提供了一个更清晰的版本,它按预期工作.

任何人都可以解释为什么会这样吗?

从博客文章中获取和修改的代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace RegExTester
{
    class Program
    {
        static void Main(string[] args)
        {
            DateTime startTime = DateTime.Now;

            for (int i = 0; i < 1000; i++)
            {
                CheckForMatches("some random text with email address, address@domain200.com" + i.ToString());    
            }


            double msTaken = DateTime.Now.Subtract(startTime).TotalMilliseconds;
            Console.WriteLine("Full Run: " + msTaken);


            startTime = DateTime.Now;

            for (int i = 0; i < 1000; i++)
            {
                CheckForMatches("some random text with email address, address@domain200.com" + i.ToString());
            }


            msTaken = DateTime.Now.Subtract(startTime).TotalMilliseconds;
            Console.WriteLine("Full Run: " + msTaken);

            Console.ReadLine();

        }


        private static List<Regex> _expressions;
        private static object _SyncRoot = new object();

        private static List<Regex> GetExpressions()
        {
            if (_expressions != null)
                return _expressions;

            lock (_SyncRoot)
            {
                if (_expressions == null)
                {
                    DateTime startTime = DateTime.Now;

                    List<Regex> tempExpressions = new List<Regex>();
                    string regExPattern =
                        @"^[a-zA-Z0-9]+[a-zA-Z0-9._%-]*@{0}$";

                    for (int i = 0; i < 2000; i++)
                    {
                        tempExpressions.Add(new Regex(
                            string.Format(regExPattern,
                            Regex.Escape("domain" + i.ToString() + "." +
                            (i % 3 == 0 ? ".com" : ".net"))),
                            RegexOptions.IgnoreCase));//  | RegexOptions.Compiled
                    }

                    _expressions = new List<Regex>(tempExpressions);
                    DateTime endTime = DateTime.Now;
                    double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
                    Console.WriteLine("Init:" + msTaken);
                }
            }

            return _expressions;
        }

        static  List<Regex> expressions = GetExpressions();

        private static void CheckForMatches(string text)
        {

            DateTime startTime = DateTime.Now;


                foreach (Regex e in expressions)
                {
                    bool isMatch = e.IsMatch(text);
                }


            DateTime endTime = DateTime.Now;
            //double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
            //Console.WriteLine("Run: " + msTaken);

        }
    }
}
Run Code Online (Sandbox Code Playgroud)

Jim*_*hel 38

编译后的正则表达式在按预期使用时匹配得更快.正如其他人所指出的那样,我们的想法是将它们编译一次并多次使用它们.构造和初始化时间在这些运行中摊销.

我创建了一个更简单的测试,它将向您展示编译的正则表达式无疑比未编译的更快.

    const int NumIterations = 1000;
    const string TestString = "some random text with email address, address@domain200.com";
    const string Pattern = "^[a-zA-Z0-9]+[a-zA-Z0-9._%-]*@domain0\\.\\.com$";
    private static Regex NormalRegex = new Regex(Pattern, RegexOptions.IgnoreCase);
    private static Regex CompiledRegex = new Regex(Pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
    private static Regex DummyRegex = new Regex("^.$");

    static void Main(string[] args)
    {
        var DoTest = new Action<string, Regex, int>((s, r, count) =>
            {
                Console.Write("Testing {0} ... ", s);
                Stopwatch sw = Stopwatch.StartNew();
                for (int i = 0; i < count; ++i)
                {
                    bool isMatch = r.IsMatch(TestString + i.ToString());
                }
                sw.Stop();
                Console.WriteLine("{0:N0} ms", sw.ElapsedMilliseconds);
            });

        // Make sure that DoTest is JITed
        DoTest("Dummy", DummyRegex, 1);
        DoTest("Normal first time", NormalRegex, 1);
        DoTest("Normal Regex", NormalRegex, NumIterations);
        DoTest("Compiled first time", CompiledRegex, 1);
        DoTest("Compiled", CompiledRegex, NumIterations);

        Console.WriteLine();
        Console.Write("Done. Press Enter:");
        Console.ReadLine();
    }
Run Code Online (Sandbox Code Playgroud)

设置NumIterations为500给我这个:

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 1 ms
Testing Compiled first time ... 13 ms
Testing Compiled ... 1 ms
Run Code Online (Sandbox Code Playgroud)

通过500万次迭代,我得到:

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 17,232 ms
Testing Compiled first time ... 17 ms
Testing Compiled ... 15,299 ms
Run Code Online (Sandbox Code Playgroud)

在这里,您可以看到编译的正则表达式比未编译的版本快至少10%.

有趣的是,如果RegexOptions.IgnoreCase从正则表达式中删除,则500万次迭代的结果更加惊人:

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 12,869 ms
Testing Compiled first time ... 14 ms
Testing Compiled ... 8,332 ms
Run Code Online (Sandbox Code Playgroud)

这里,编译的正则表达式比未编译的正则表达式快35%.

在我看来,你引用的博客文章只是一个有缺陷的测试.

  • @IDisposable:实际上,逗号是千位分隔符.它是17秒并且改变,15秒并且改变.报告的数字是500万次迭代的总时间,而不是每次迭代的平均时间. (2认同)

Muh*_*han 6

http://www.codinghorror.com/blog/2005/03/to-compile-or-not-to-compile.html

只有在实例化一次并重复使用多次时,编译才有帮助.如果你在for循环中创建一个已编译的正则表达式,那么它显然会表现得更糟.你能告诉我们你的示例代码吗?

  • 但是,您的代码实例化(并因此编译)循环内的正则表达式,因此您实际上编译了500次. (2认同)

Gab*_*abe 5

这个基准测试的问题是编译的Regexes有创建一个全新程序集并将其加载到AppDomain的开销.

编译Regex的设计方案(我相信 - 我没有设计它们)有数百个Regex执行数百万次,而不是数千个Regex执行数千次.如果你不打算在一百万次的领域执行正则表达式,你可能甚至不会弥补JIT编译它的时间.