如何编写更易于维护的正则表达式？

Question

如何编写更易于维护的正则表达式？

ojb*_*ass 41 regex maintenance readability

我开始觉得使用正则表达式会降低代码的可维护性.正则表达式的简洁性和强大功能有些恶意.Perl将其与副作用(如默认运算符)相结合.

我有习惯记录正则表达式,至少有一个句子给出基本意图,至少有一个匹配的例子.

因为构建了正则表达式,所以我觉得对表达式中每个元素的最大组件进行注释是绝对必要的.尽管如此,即便是我自己的正则表达式让我摸不着头脑,好像我在读克林贡一样.

你故意愚弄你的正则表达式吗？你是否将可能更短,更强大的那些分解成更简单的步骤？我放弃了嵌套正则表达式.是否存在由于可维护性问题而避免的正则表达式构造？

不要让这个例子覆盖这个问题.

如果迈克尔·艾什的下面有一些错误,你会有什么可以做任何事情,但完全扔掉它？

^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

Run Code Online (Sandbox Code Playgroud)

根据请求,可以使用上面的Ash先生的链接找到确切的目的.

比赛 01.1.02 | 11-30-2001 | 2000年2月29日

非比赛 02/29/01 | 13/01/2002 | 11/00/02

Answer 1

Mit*_*eat 32

使用Expresso,它给出了正则表达式的分层,英语细分.

要么

这提示由达伦Neimke:

.NET允许通过RegExOptions.IgnorePatternWhitespace编译器选项和嵌入在模式字符串的每一行中的(？#...)语法,使用嵌入式注释创建正则表达式模式.

这允许在每行中嵌入类似psuedo-code的注释,并对可读性产生以下影响:

Dim re As New Regex ( _
    "(?<=       (?# Start a positive lookBEHIND assertion ) " & _
    "(#|@)      (?# Find a # or a @ symbol ) " & _
    ")          (?# End the lookBEHIND assertion ) " & _
    "(?=        (?# Start a positive lookAHEAD assertion ) " & _
    "   \w+     (?# Find at least one word character ) " & _
    ")          (?# End the lookAHEAD assertion ) " & _
    "\w+\b      (?# Match multiple word characters leading up to a word boundary)", _
    RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)

Run Code Online (Sandbox Code Playgroud)

这是另一个.NET示例(需要RegexOptions.Multiline和RegexOptions.IgnorePatternWhitespace选项):

static string validEmail = @"\b    # Find a word boundary
                (?<Username>       # Begin group: Username
                [a-zA-Z0-9._%+-]+  #   Characters allowed in username, 1 or more
                )                  # End group: Username
                @                  # The e-mail '@' character
                (?<Domainname>     # Begin group: Domain name
                [a-zA-Z0-9.-]+     #   Domain name(s), we include a dot so that
                                   #   mail.somewhere is also possible
                .[a-zA-Z]{2,4}     #   The top level domain can only be 4 characters
                                   #   So .info works, .telephone doesn't.
                )                  # End group: Domain name
                \b                 # Ending on a word boundary
                ";

Run Code Online (Sandbox Code Playgroud)

如果您的RegEx适用于常见问题,则另一种选择是将其记录并提交给RegExLib,在RegExLib中对其进行评级和评论.什么都不比许多眼睛好......

另一个RegEx工具是The Regulator

Answer 2

Jam*_*mes 19

我通常只是尝试将所有正则表达式调用包含在自己的函数中,并使用有意义的名称和一些基本注释.我喜欢将正则表达式视为只写语言,只能由编写它的人阅读(除非它非常简单).我完全期望有人可能需要完全重写表达式,如果他们必须改变其意图,这可能是为了更好地保持正则表达式训练活着.

Answer 3

cha*_*aos 17

好吧,PCRE/x修饰符的整个生命目的是让你更可读地编写正则表达式,就像在这个简单的例子中一样:

my $expr = qr/
    [a-z]    # match a lower-case letter
    \d{3,5}  # followed by 3-5 digits
/x;

Run Code Online (Sandbox Code Playgroud)

在Perl中,这将是建议的具有易读性问题的deailng方法.特别是/ x修饰符告诉Perl忽略所有空格(正则表达式必须使用\ s来指定搜索中的空格)以及允许'#'字符的行为类似于普通注释. (3认同)

Answer 4

pax*_*blo 8

有些人将RE用于错误的东西(我正在等待关于如何使用单个RE检测有效C++程序的第一个SO问题).

我经常发现,如果我不能将我的RE放在60个字符以内,最好不要成为一段代码,因为这几乎总是更具可读性.

无论如何,我总是在代码中记录RE应该实现的内容,非常详细.这是因为我知道,从痛苦的经历来看,对于其他人(甚至是我,六个月后)进入并试图理解是多么困难.

我不相信他们是邪恶的,虽然我相信一些使用它们的人是邪恶的(不是看着你,Michael Ash :-).它们是一个很好的工具,但是,就像电锯一样,如果你不知道如何正确使用它们,你会剪断你的腿.

更新:实际上,我刚刚跟踪了那个怪物的链接,它是为了验证1600年到999年之间的m/d/y格式日期.这是一个经典案例,其中完整的代码将更易读和可维护.

您只需将其拆分为三个字段并检查各个值.如果我的一个仆从买了这个,我几乎认为这是一个值得终止的罪行.我当然会把它们送回来正确写出来.

Answer 5

Cha*_*ens 5

这是同样的正则表达式分解成易消化的碎片.除了更具可读性之外,一些子正则表达式本身也很有用.更改允许的分隔符也更加容易.

#!/usr/local/ActivePerl-5.10/bin/perl

use 5.010; #only 5.10 and above
use strict;
use warnings;

my $sep         = qr{ [/.-] }x;               #allowed separators    
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century 
my $any_decade  = qr/ [0-9]{2} /x;            #match any decade or 2 digit year
my $any_year    = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year

#match the 1st through 28th for any month of any year
my $start_of_month = qr/
    (?:                         #match
        0?[1-9] |               #Jan - Sep or
        1[0-2]                  #Oct - Dec
    )
    ($sep)                      #the separator
    (?: 
        0?[1-9] |               # 1st -  9th or
        1[0-9]  |               #10th - 19th or
        2[0-8]                  #20th - 28th
    )
    \g{-1}                      #and the separator again
/x;

#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
    (?:
        (?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
        ($sep)                  #the separator
        31                      #the 31st
        \g{-1}                  #and the separator again
        |                       #or
        (?: 0?[13-9] | 1[0-2] ) #match all months but Feb
        ($sep)                  #the separator
        (?:29|30)               #the 29th or the 30th
        \g{-1}                  #and the separator again
    )
/x;

#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;

#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
    0?2                         #match Feb
    ($sep)                      #the separtor
    29                          #the 29th
    \g{-1}                      #the separator again
    (?:
        $any_century?           #any century
        (?:                     #and decades divisible by 4 but not 100
            0[48]       | 
            [2468][048] |
            [13579][26]
        )
        |
        (?:                     #or match centuries that are divisible by 4
            16          | 
            [2468][048] |
            [3579][26]
        )
        00                      
    )
/x;

my $any_date  = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;

say "test against garbage";
for my $date (qw(022900 foo 1/1/1)) {
    say "\t$date ", $date ~~ $only_date ? "matched" : "didn't match";
}
say '';

#comprehensive test

my @code = qw/good unmatch month day year leap/;
for my $sep (qw( / - . )) {
    say "testing $sep";
    my $i  = 0;
    for my $y ("00" .. "99", 1600 .. 9999) {
        say "\t", int $i/8500*100, "% done" if $i++ and not $i % 850;
        for my $m ("00" .. "09", 0 .. 13) {
            for my $d ("00" .. "09", 1 .. 31) {
                my $date = join $sep, $m, $d, $y;
                my $re   = $date ~~ $only_date || 0;
                my $code = not_valid($date);
                unless ($re == !$code) {
                    die "error $date re $re code $code[$code]\n"
                }
            }
        }
    }
}

sub not_valid {
    state $end = [undef, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31];
    my $date      = shift;
    my ($m,$d,$y) = $date =~ m{([0-9]+)[-./]([0-9]+)[-./]([0-9]+)};
    return 1 unless defined $m; #if $m is set, the rest will be too
    #components are in roughly the right ranges
    return 2 unless $m >= 1 and $m <= 12;
    return 3 unless $d >= 1 and $d <= $end->[$m];
    return 4 unless ($y >= 0 and $y <= 99) or ($y >= 1600 and $y <= 9999);
    #handle the non leap year case
    return 5 if $m == 2 and $d == 29 and not leap_year($y);

    return 0;
}

sub leap_year {
    my $y    = shift;
    $y = "19$y" if $y < 1600;
    return 1 if 0 == $y % 4 and 0 != $y % 100 or 0 == $y % 400;
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年，5 月前
查看次数：	2780 次
最近记录：	14 年，7 月前