我有两个使用相同库来处理文档的Perl程序.它们安装在两个不同的服务器上,一个运行Perl 5.12,另一个运行Perl 5.18.
现在我正在输入相同的文件作为两者的输入,所以我可以区分输出以确保它们匹配.我得到了数百个相同的比赛.他们通常处理UTF-8文件,我已经注意正确处理该编码.
今天他们都收到了二进制文件,这是我第一次看到差异.我确定一个程序(运行Perl 5.18的程序)在输出之前从文件内容中删除了垂直选项卡,而另一个程序没有.
我可以把它写成不支持二进制文件,但它仍然困扰我,他们是不同的.我查看了进行处理的库,它包含了这一行(它将以这种方式处理文件中的每一行):
$line =~ s/\s//g;
Run Code Online (Sandbox Code Playgroud)
有可能其中一个Perls认为垂直标签是空格,而另一个不是吗?我该怎么检查?还有你认为我应该研究的其他事情吗?
从5.18开始,垂直制表符被视为空格.
没有人能回想起为什么
\s不匹配\cK,垂直标签.现在确实如此.鉴于这个角色的极端罕见,预计会有很少的破损.那就是说,这就是它的含义:
\s在正则表达式中现在匹配所有情况下的垂直制表符.
/x使用修饰符时,将忽略正则表达式文字中的文字垂直制表符.当将字符串解释为数字时,现在忽略单独或与其他空格混合的前导垂直制表符.例如:
Run Code Online (Sandbox Code Playgroud)$dec = " \cK \t 123"; $hex = " \cK \t 0xF"; say 0 + $dec; # was 0 with warning, now 123 say int $dec; # was 0, now 123 say oct $hex; # was 0, now 15
这使得Perl符合Unicode,它将U + 000B LINE TABULATION又称VERTICAL TABULATION又称VT作为一个White_Space角色.
您可以通过更换找回昔日的行为\s有[^\S\x0B].
另外值得考虑的是\h,它只匹配水平空白字符.
U+0009 CHARACTER TABULATION Matched by \s & \h
U+000A LINE FEED Matched by \s & \v
U+000B LINE TABULATION Matched by \s & \v
U+000C FORM FEED Matched by \s & \v
U+000D CARRIAGE RETURN Matched by \s & \v
U+0020 SPACE Matched by \s & \h
U+0085 NEXT LINE Matched by \s & \v
U+00A0 NO-BREAK SPACE Matched by \s & \h
U+1680 OGHAM SPACE MARK Matched by \s & \h
U+2000 EN QUAD Matched by \s & \h
U+2001 EM QUAD Matched by \s & \h
U+2002 EN SPACE Matched by \s & \h
U+2003 EM SPACE Matched by \s & \h
U+2004 THREE-PER-EM SPACE Matched by \s & \h
U+2005 FOUR-PER-EM SPACE Matched by \s & \h
U+2006 SIX-PER-EM SPACE Matched by \s & \h
U+2007 FIGURE SPACE Matched by \s & \h
U+2008 PUNCTUATION SPACE Matched by \s & \h
U+2009 THIN SPACE Matched by \s & \h
U+200A HAIR SPACE Matched by \s & \h
U+2028 LINE SEPARATOR Matched by \s & \v
U+2029 PARAGRAPH SEPARATOR Matched by \s & \v
U+202F NARROW NO-BREAK SPACE Matched by \s & \h
U+205F MEDIUM MATHEMATICAL SPACE Matched by \s & \h
U+3000 IDEOGRAPHIC SPACE Matched by \s & \h
Run Code Online (Sandbox Code Playgroud)