Converting C++ Boost Regexes to Python re regexes

Question

Converting C++ Boost Regexes to Python re regexes

The aim is to convert these regexes in C++ boost to Python re regexes:

  typedef boost::u32regex tRegex;

  tRegex emptyre = boost::make_u32regex("^$");
  tRegex commentre = boost::make_u32regex("^;.*$");
  tRegex versionre = boost::make_u32regex("^@\\$Date: (.*) \\$$");
  tRegex includere = boost::make_u32regex("^<(\\S+)$");
  tRegex rungroupre = boost::make_u32regex("^>(\\d+)$");
  tRegex readreppre = boost::make_u32regex("^>(\\S+)$");
  tRegex tokre = boost::make_u32regex("^:(.*)$");
  tRegex groupstartre = boost::make_u32regex("^#(\\d+)$");
  tRegex groupendre = boost::make_u32regex("^#$");
  tRegex rulere = boost::make_u32regex("^([!-+^])([^\\t]+)\\t+([^\\t]*)$");

Run Code Online (Sandbox Code Playgroud)

I could rewrite these regexes one by one but there're quite a lot more that the example above, so my question is with regards to

how to convert C++ boost regexest to Python and
what is the difference between boost regexes and python re regexes?

Is the C++ boost::u32regex the same as re regexes in python? If not, what is the difference? (Links to the docs would be much appreciated =) ) For instance:

in boost, there's boost::u32regex_match, is that the same as re.match?
in boost, there's boost::u32regex_search, how is it different to re.search
there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?

Answer 1

Wik*_*żew 3

\n
如何将 C++ boost regexest 转换为 Python
\n

\n\n

如果是简单的正则表达式，例如\\w+\\s+\\d+，或者>.*$您不必更改模式。如果具有下面提到的构造的更复杂的模式，您很可能必须重新编写正则表达式。与从一种风格/语言到另一种风格/语言的任何转换一样，一般答案是“不要”。然而，Python 和 Boost 确实有一些相似之处，特别是当涉及到包含点 ( a.*b)、常规 ( [\\w-]*) 和否定 ( [^>]*) 字符类、常规量词（例如+// *）的简单模式（如果 Boost 使用类似 PCRE 的模式）时?诸如此类。

\n\n

\n
boost 正则表达式和 python 正则表达式有什么区别re？
\n

\n\n

Pythonre模块并不像Boost 正则表达式那么丰富（只需提及诸如\\h、\\G、\\K、、\\R、\\X、\\Q...\\E分支重置、递归、所有格量词、POSIX 字符类和字符属性、扩展替换模式等构造）以及 Boost 具有的其他功能。仅限(?imsx-imsx:pattern)于 Python 中的整个表达式，而不是其中的一部分，因此您应该知道，(?i)在您的&|&#((?i)x26);|&will 中，它会被视为位于模式的开头（但是，它对此表达式没有任何影响）。

\n\n

另外，与 Boost 中一样，您不必[在字符类内部和{字符类外部进行转义。

\n\n

反向引用\\1与 Python 中的相同。

\n\n

由于您没有在模式中交替使用捕获组（例如re.sub(r\'\\d(\\w)|(go\\w*)\', \'\\2\', \'goon\')），因此应该没有问题（在这种情况下，Python 不会使用任何值填充非参与组，并返回空结果）。

\n\n

请注意Boost 和Python 中命名组定义的差异：(?<NAME>expression)/ 。(?\'NAME\'expression)(?P<NAME>expression)

\n\n

我看到你的正则表达式主要属于“简单”类别。最复杂的模式是调和的贪婪标记（例如\xe2\x8c\x8a-((?:(?!-\xe2\x8c\x8b).)*)-\xe2\x8c\x8b）。要优化它们，您可以使用展开循环技术，但可能没有必要，具体取决于您使用表达式处理的文本大小。

\n\n

我认为最麻烦的部分是您大量使用 Unicode 文字。在 Python 2.x 中，所有字符串都是字节数组，并且您始终必须确保将 unicode 对象传递给 Unicode 正则表达式（请参阅Python 2.x\xe2\x80\x99s Unicode 支持）。在 Python 3 中，默认情况下所有字符串都是 UTF8，您甚至可以在源代码中使用 UTF8 文字字符，而无需任何其他操作（请参阅Python\xe2\x80\x99s Unicode 支持）。因此，Python 3.3+（支持原始字符串文字）是一个不错的选择。

\n\n

现在，至于剩下的问题：

\n\n

\n
在boost中，有boost::u32regex_match，和一样吗 re.match？
\n

\n\n

re.match与 regex_match 不同，regex_match是在字符串开头re.match查找匹配，并且需要完整的字符串匹配。但是，在 Python 3 中，您可以使用相当于 Boost 的方法。regex_matchre.fullmatch(pattern, string, flags=0)regex_match

\n\n
\n
在 boost 中，有boost::u32regex_search，它与re.search
\n
\n\n
每当您需要在字符串内的任何位置查找匹配项时，您都需要使用re.search（请参阅match()与search()）。regex_search因此，该方法提供了与Boost 中类似的功能。
\n\n
\n
还有boost::format_perlandboost::match_default和boost::smatch，它们在 python 中的等价物是什么re？
\n
\n\n
Python 并不像 Boost 那样支持类似 Perl 的表达式，Pythonre模块只是一个“修剪过的”Perl 正则表达式引擎，没有我之前提到的许多好的功能。因此，在那里找不到类似default或的标志。perl至于smatch，您可以使用re.finditer来获取所有匹配对象。Are.findall将所有匹配项（或仅在指定捕获组时才返回子匹配项）作为字符串列表/元组列表返回。看到re.findall不同re.finditer。
\n\n
最后，这是一篇必读文章Python\xe2\x80\x99s re Module。
\n

归档时间：	9 年，10 月前
查看次数：	657 次
最近记录：	9 年，10 月前