英文数字的可扩展正则表达式

Joã*_*lva 5 java regex perl

我试图创建一个正则表达式来识别 英文数字,如一个,19,20,122,等等,一直到上百万.我想重用正则表达式的一些部分,因此正则表达式是由部分构造的,如下所示:

// replace <TAG> with the content of the variable
ONE_DIGIT = (?:one|two|three|four|five|six|seven|eight|nine)
TEEN = (?:ten|eleven|twelve|(?:thir|for|fif|six|seven|eigh|nine)teen)
TWO_DIGITS = (?:(?:twen|thir|for|fif|six|seven|eigh|nine)ty(?:\s+<ONE_DIGIT>)?|<TEEN>)
// HUNDREDS, et cetera
Run Code Online (Sandbox Code Playgroud)

我想知道是否有人已经做过同样的事情(并希望分享),因为这些正则表达式很长,而且他们可能有他们不应该做的事情,或者我可能会遗漏的事情.此外,我希望它们尽可能高效,所以我期待任何优化提示.我正在使用Java正则表达式引擎,但任何正则表达式都是可以接受的.

Sin*_*nür 8

参见Perl的Lingua :: EN :: Words2NumsLingua :: EN :: FindNumber.

特别是,源代码Lingua::EN::FindNumber包含:

# This is from Lingua::EN::Words2Nums, after being thrown through
# Regex::PreSuf
my $numbers =
    qr/((?:b(?:akers?dozen|illi(?:ard|on))|centillion|d(?:ecilli(?:ard|on)|ozen|u(?:o(?:decilli(?:ard|on)|vigintillion)|vigintillion))|e(?:ight(?:een|ieth|[yh])?|leven(?:ty(?:first|one))?|s)|f(?:i(?:ft(?:een|ieth|[yh])|rst|ve)|o(?:rt(?:ieth|y)|ur(?:t(?:ieth|[yh]))?))|g(?:oogol(?:plex)?|ross)|hundred|mi(?:l(?:ion|li(?:ard|on))|nus)|n(?:aught|egative|in(?:et(?:ieth|y)|t(?:een|[yh])|e)|o(?:nilli(?:ard|on)|ught|vem(?:dec|vigint)illion))|o(?:ct(?:illi(?:ard|on)|o(?:dec|vigint)illion)|ne)|qu(?:a(?:drilli(?:ard|on)|ttuor(?:decilli(?:ard|on)|vigintillion))|in(?:decilli(?:ard|on)|tilli(?:ard|on)|vigintillion))|s(?:core|e(?:cond|pt(?:en(?:dec|vigint)illion|illi(?:ard|on))|ven(?:t(?:ieth|y))?|x(?:decillion|tilli(?:ard|on)|vigintillion))|ix(?:t(?:ieth|y))?)|t(?:ee?n|h(?:ir(?:t(?:een|ieth|y)|d)|ousand|ree)|r(?:e(?:decilli(?:ard|on)|vigintillion)|i(?:gintillion|lli(?:ard|on)))|w(?:e(?:l(?:fth|ve)|nt(?:ieth|y))|o)|h)|un(?:decilli(?:ard|on)|vigintillion)|vigintillion|zero|s))/i;
Run Code Online (Sandbox Code Playgroud)

Perl的艺术许可证限制.

您可以使用Regex :: PreSuf自动分解常见的前缀和后缀:

#!/usr/bin/perl

use strict;
use warnings;

use Regex::PreSuf;

my %singledigit = (
    one    => 1,
    two    => 2,
    three  => 3,
    four   => 4,
    five   => 5,
    six    => 6,
    seven  => 7,
    eight  => 8,
    nine   => 9,
);

my $singledigit = presuf(keys %singledigit);

print $singledigit, "\n";

my $text = "one two three four five six seven eight nine";

$text =~ s/($singledigit)/$singledigit{$1}/g;

print $text, "\n";
Run Code Online (Sandbox Code Playgroud)

输出:

C:\Temp> cvb
(?:eight|f(?:ive|our)|nine|one|s(?:even|ix)|t(?:hree|wo))
1 2 3 4 5 6 7 8 9

我担心这会变得更难;-)