不同的Perls对垂直标签的不同处理

Ste*_*hen 2 perl

我有两个使用相同库来处理文档的Perl程序.它们安装在两个不同的服务器上,一个运行Perl 5.12,另一个运行Perl 5.18.

现在我正在输入相同的文件作为两者的输入,所以我可以区分输出以确保它们匹配.我得到了数百个相同的比赛.他们通常处理UTF-8文件,我已经注意正确处理该编码.

今天他们都收到了二进制文件,这是我第一次看到差异.我确定一个程序(运行Perl 5.18的程序)在输出之前从文件内容中删除了垂直选项卡,而另一个程序没有.

我可以把它写成不支持二进制文件,但它仍然困扰我,他们是不同的.我查看了进行处理的库,它包含了这一行(它将以这种方式处理文件中的每一行):

$line =~ s/\s//g;
Run Code Online (Sandbox Code Playgroud)

有可能其中一个Perls认为垂直标签是空格,而另一个不是吗?我该怎么检查?还有你认为我应该研究的其他事情吗?

ike*_*ami 7

从5.18开始,垂直制表符被视为空格.

没有人能回想起为什么\s不匹配\cK,垂直标签.现在确实如此.鉴于这个角色的极端罕见,预计会有很少的破损.那就是说,这就是它的含义:

\s 在正则表达式中现在匹配所有情况下的垂直制表符.

/x使用修饰符时,将忽略正则表达式文字中的文字垂直制表符.

当将字符串解释为数字时,现在忽略单独或与其他空格混合的前导垂直制表符.例如:

$dec = " \cK \t 123";
$hex = " \cK \t 0xF";
say 0 + $dec;   # was 0 with warning, now 123
say int $dec;   # was 0, now 123
say oct $hex;   # was 0, now  15
Run Code Online (Sandbox Code Playgroud)

这使得Perl符合Unicode,它将U + 000B LINE TABULATION又称VERTICAL TABULATION又称VT作为一个White_Space角色.


您可以通过更换找回昔日的行为\s[^\S\x0B].

另外值得考虑的是\h,它只匹配水平空白字符.

U+0009 CHARACTER TABULATION        Matched by \s & \h
U+000A LINE FEED                   Matched by \s & \v
U+000B LINE TABULATION             Matched by \s & \v
U+000C FORM FEED                   Matched by \s & \v
U+000D CARRIAGE RETURN             Matched by \s & \v
U+0020 SPACE                       Matched by \s & \h
U+0085 NEXT LINE                   Matched by \s & \v
U+00A0 NO-BREAK SPACE              Matched by \s & \h
U+1680 OGHAM SPACE MARK            Matched by \s & \h
U+2000 EN QUAD                     Matched by \s & \h
U+2001 EM QUAD                     Matched by \s & \h
U+2002 EN SPACE                    Matched by \s & \h
U+2003 EM SPACE                    Matched by \s & \h
U+2004 THREE-PER-EM SPACE          Matched by \s & \h
U+2005 FOUR-PER-EM SPACE           Matched by \s & \h
U+2006 SIX-PER-EM SPACE            Matched by \s & \h
U+2007 FIGURE SPACE                Matched by \s & \h
U+2008 PUNCTUATION SPACE           Matched by \s & \h
U+2009 THIN SPACE                  Matched by \s & \h
U+200A HAIR SPACE                  Matched by \s & \h
U+2028 LINE SEPARATOR              Matched by \s & \v
U+2029 PARAGRAPH SEPARATOR         Matched by \s & \v
U+202F NARROW NO-BREAK SPACE       Matched by \s & \h
U+205F MEDIUM MATHEMATICAL SPACE   Matched by \s & \h
U+3000 IDEOGRAPHIC SPACE           Matched by \s & \h
Run Code Online (Sandbox Code Playgroud)