Edw*_*rdr 7 regex parsing text nlp text-extraction
我会把它放在那里:我对正则表达式很糟糕.我试图想出一个来解决我的问题,但我真的不太了解它们...
想象一下以下几句话:
- 你好,等等.它大约是11 1/2"x 32".
 - 尺寸为8 x 10-3/5!
 - 可能在22"x 17"区域的某个地方.
 - 卷很大:42 1/2"x 60码.
 - 它们都是5.76乘8帧.
 - 是的,也许它长约84厘米.
 - 我想13/19".
 - 不,它实际上可能是86厘米.
 
我希望尽可能干净地从这些句子中提取项目维度.在完美的世界中,正则表达式将输出以下内容:
- 11 1/2"x 32"
 - 8 x 10-3/5
 - 22"x 17"
 - 42 1/2"x 60码
 - 5.76乘8
 - 84厘米
 - 13/19"
 - 86厘米
 
我想象一个适用以下规则的世界:
{cm, mm, yd, yards, ", ', feet}虽然我更喜欢考虑任意一组单位的解决方案,而不是上述单位的明确解决方案.4/5"./将分子/分母分开,人们可以假设各部分之间没有空间(尽管如果有人认为这很好!).{x, by}.如果一个维度只是一维的,那么它必须具有上述集合中的单位,即,22 cm是,.333不是,也不是4.33 oz.为了向你展示我对正则表达式的无用(并告诉我至少尝试过!),我就这么做了...
[1-9]+[/ ][x1-9]
Run Code Online (Sandbox Code Playgroud)
更新(2)
你们这些人非常快速有效!我将添加一些以下正则表达式未涵盖的测试用例:
- 最后一个测试用例是12码x.
 - 最后一个测试案例是99厘米.
 - 这句话没有尺寸:342/5553/222.
 - 三个维度?22"x 17"x 12 cm
 - 这是一个产品代码:c720与另一个数字83 x更好.
 - 一个数字本身21.
 - 体积不应与0.332盎司相匹配.
 
这些应该导致以下结果(#表示什么都不匹配):
- 12码
 - 99厘米
 - #
 - 22"x 17"x 12 cm
 - #
 - #
 - #
 
我在下面改编了M42的答案:
\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)?
Run Code Online (Sandbox Code Playgroud)
但是,虽然这解决了一些新的测试用例,但它现在无法与以下其他测试用例匹配.它报告:
- 11 1/2"x 32"通行证
 - (没什么)失败
 - 22"x 17"通行证
 - 42 1/2"x 60码通过
 - (没什么)失败
 - 84cm通过
 - 13/19"通过
 - 86厘米通行证
 - 22"通过
 - (没什么)失败
 (没什么)失败
12码x失败
- 失败99厘米
 - 22"x 17"[并且,但是分开'12 cm']失败
 通过
通过
新版本,靠近目标,2个测试失败
#!/usr/local/bin/perl 
use Modern::Perl;
use Test::More;
my $re1 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)/;
my $re2 = qr/(?:\s*x\s*|\s*by\s*)/;
my $re3 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet|frames)/;
my @out = (
'11 1/2" x 32"',
'8 x 10-3/5',
'22" x 17"',
'42 1/2" x 60 yd',
'5.76 by 8 frames',
'84cm',
'13/19"',
'86 cm',
'12 yd',
'99 cm',
'no match',
'22" x 17" x 12 cm',
'no match',
'no match',
'no match',
);
my $i = 0;
my $xx = '22" x 17"';
while(<DATA>) {
    chomp;
    if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) {
        ok($1 eq $out[$i], $1 . ' in ' . $_);
    } else {
        ok($out[$i] eq 'no match', ' got "no match" in '.$_);
    }
    $i++;
}
done_testing;
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.  
A number on its own 21.
A volume shouldn't match 0.332 oz.
Run Code Online (Sandbox Code Playgroud)
输出:
#   Failed test ' got "no match" in The dimensions are 8 x 10-3/5!'
#   at C:\tests\perl\test6.pl line 42.
#   Failed test ' got "no match" in They are all 5.76 by 8 frames.'
#   at C:\tests\perl\test6.pl line 42.
# Looks like you failed 2 tests of 15.
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32".
not ok 2 -  got "no match" in The dimensions are 8 x 10-3/5!
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17".
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd.
not ok 5 -  got "no match" in They are all 5.76 by 8 frames.
ok 6 - 84cm in Yeah, maybe it's around 84cm long.
ok 7 - 13/19" in I think about 13/19".
ok 8 - 86 cm in No, it's probably 86 cm actually.
ok 9 - 12 yd in The last but one test case is 12 yd x.
ok 10 - 99 cm in The last test case is 99 cm by.
ok 11 -  got "no match" in This sentence doesn't have dimensions in it: 342 / 5553 / 222.
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm
ok 13 -  got "no match" in This is a product code: c720 with another number 83 x better.  
ok 14 -  got "no match" in A number on its own 21.
ok 15 -  got "no match" in A volume shouldn't match 0.332 oz.
1..15
Run Code Online (Sandbox Code Playgroud)
似乎很难匹配,5.76 by 8 frames但0.332 oz有时你不得不将数字与单位和数字相匹配而没有单位.
对不起,我无法做得更好.
|   归档时间:  |  
           
  |  
        
|   查看次数:  |  
           1968 次  |  
        
|   最近记录:  |