正则表达速度:在VS2013下,Python比C++ 11快6倍？

Question

正则表达速度:在VS2013下,Python比C++ 11快6倍？

Gab*_*iMe 11 python regex performance c++11 visual-studio-2013

可能是python的C正则表达式实现快了6倍还是我错过了什么？

Python版本:

import re
r=re.compile(r'(HELLO).+?(\d+)', re.I)
s=r"prefixdfadfadf adf adf adf adf he asdf dHello Regex 123"

%timeit r.search(s)

1000000 loops, best of 3: 1.3 µs per loop (769,000 per sec)

Run Code Online (Sandbox Code Playgroud)

C++ 11版本:

#include<regex>
int main(int argc, char * argv[])
{
    std::string s = "prefixdfadfadf adf adf adf adf he asdf dHello Regex 123";
    std::regex my(R"((HELLO).+?(\d+))", regex_constants::icase);

    bench_utils::run(std::chrono::seconds(10),
        [&]{
        std::smatch match;
        bool found = std::regex_search(s, match, my);
    });       
    return 0;
}

Results in about ~125,000 searches/second

Run Code Online (Sandbox Code Playgroud)

编辑: 这是bench_utils的代码:

namespace bench_utils
{
    template<typename T>    
    inline std::string formatNum(const T& value)
    {
            static std::locale loc("");
            std::stringstream ss;
            ss.imbue(loc);
            ss << value;
            return ss.str();
        }

    inline void run(const std::chrono::milliseconds &duration,
        const std::function<void() >& fn)
    {
        using namespace std::chrono;
        typedef steady_clock the_clock;
        size_t counter = 0;
        seconds printInterval(1);
        auto startTime = the_clock::now();
        auto lastPrintTime = startTime;
        while (true)
        {
            fn();
            counter++;
            auto now = the_clock::now();
            if (now - startTime >= duration)
                break;
            auto p = now - lastPrintTime;
            if (now - lastPrintTime >= printInterval)
            {
                std::cout << formatNum<size_t>(counter) << " ops per second" << std::endl;
                counter = 0;
                lastPrintTime = the_clock::now();
            }
        }
    }

}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Bla*_*lsh 7

首先要注意的是，在Python中，正则表达式（无论是否使用re，或regex模块）以“ c的速度”发生，也就是说，实际的繁重代码是冷硬的c，因此至少对于较长的字符串而言，性能是将取决于c regexp实现。

有时python非常聪明，python每秒执行数千万次操作时没有问题，并且每秒可以创建数百万个对象-这比c慢一千倍，但是如果我们要说的话，以微秒开始，python的开销可能并不重要，它只会对每个函数调用增加0.1微秒。

因此，在这种情况下，Python的相对速度并不重要。就绝对而言，它足够快，重要的是正则表达式函数执行其操作的速度。

我重写了c ++的情况，以免受到任何批评（我希望，随时指出任何问题），实际上，它甚至不需要创建匹配对象，因为搜索只是返回一个布尔值（对/错）：

#include <regex>
#include <iostream>

int main(int argc, char * argv[])
{
    std::string s = "prefixdfadfadf adf adf adf adf he asdf dHello Regex 123";
    std::regex my(R"((HELLO).+?(\d+))", std::regex_constants::icase);

    int matches = 0;
    for (int i = 0; i < 1000000; ++i)
        matches += std::regex_search(s, my);


    std::cout << matches  << std::endl;
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

我写了一个类似的python程序（尽管python确实创建并返回了一个match对象），但我的结果与您的结果完全相同

C ++：6.661秒
巨蟒：1.039s

我认为这里的基本结论是，Python的regex实现只是破坏了c ++标准库之一。

它也跳动

前一段时间只是出于娱乐目的，我将Python的正则表达式性能与Go的正则表达式性能进行了比较。而python至少快两倍。

结论是python的regexp实现非常好，您当然不应该将目光投向Python以外的地方，以提高regexp的性能。从根本上讲，工作正则表达式所做的工作非常耗时，以至于Python的开销实际上丝毫不重要，并且Python实现了出色的实现（新regex模块通常比更快re）。

Answer 2

Goo*_*son 1

使用 timeit 进行基准测试是错误的，因为它提供的是 3 中最好的结果，而不是统计差异测试。

这是你的代码，而不是语言。

将函数作为 a 传递std::function会使 C++ 代码变慢；
在每次迭代中调用时钟函数；
创建新对象，例如std::smatch比赛；在每次迭代中；
运行函数；
不预编译正则表达式。

我还想知道你正在运行什么优化。

该run()功能做得太多了。解决这个问题。:)

1. 是一个无效的参数。Python 对每个操作都有更多的开销，因此*如果*它是一个重要的参数，那么对于 C++ 来说它将是一个*负的*参数。即使有*微小的*开销，C++ **应该**更快，因为它避免了 python 版本所具有的大量其他开销。另外 3. 也是无效的，因为 python *正在*为每次调用“search”创建新的匹配对象，而 C++ 在这方面也应该更快。真正重要的点是2和5，我认为尤其是5。 (4认同)

归档时间：	11 年，9 月前
查看次数：	1481 次
最近记录：	11 年，2 月前