这个问题不是特定于Perl的(尽管该unpack函数很可能会影响我的实现).
我必须处理存在多种格式的文件,以便将数据分层次地分解为有意义的部分.我希望能够做的是将文件数据解析为合适的数据结构.
这是一个例子(关于RHS的评论):
# | Format | Level | Comment
# +--------+-------+---------
**DEVICE 109523.69142 # 1 1 file-specific
.981 561A # 2 1
10/MAY/2010 24.15.30,13.45.03 # 3 2 group of records
05:03:01 AB23X 15.67 101325.72 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
05:03:15 CR22X 16.72 101325.42 # 4 3 new record
* 14 29.16264 11 0 # 5 3
06:23:51 AW41X 15.67 101323.9 # 4 3
* 14 31.26493219 0 # 5 3
11/MAY/2010 24.07.13,13.44.63 # 3 2 group of new records
15:57:14 AB23X 15.67 101327.23 # 4 3 part of single record
* 14 31.30474 13 0 # 5 3 part of single record
15:59:59 CR22X 16.72 101331.88 # 4 3 new record
* 14 29.16264 11 0 # 5
Run Code Online (Sandbox Code Playgroud)
我现在的逻辑是脆弱的:
这里最大的问题是我可以采用什么策略来确定哪个格式需要用于哪一行.我很想知道其他人是否遇到过类似情况以及他们为解决这些问题所采取的措施.
在回答你的问题时,我找到了一个简洁的主循环的有趣解决方案:
while (<>) {
given($_) {
when (@{[ map $pattern{$_}, @expect]}) {}
default {
die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
}
}
}
Run Code Online (Sandbox Code Playgroud)
正如您将在下面看到的,%pattern是针对不同格式的命名模式的散列,并且given/when针对Regex对象数组执行短路搜索以找到第一个匹配.
从这里,您可以推断出这@expect是我们希望在当前行上找到的格式名称列表.
有一段时间,我被困在多种可能的预期格式的情况下,以及如何知道匹配的格式,但后来我记得(?{ code })正则表达式:
此零宽度断言评估任何嵌入的Perl代码.它总是成功,其代码不是插值的.
这允许像穷人的yacc语法.例如,匹配和处理格式1的模式是
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
@expect = qw< fmt2 >;
})
/x,
Run Code Online (Sandbox Code Playgroud)
处理完问题后的输入后,$device包含
{
'attr1' => '109523.69142',
'attr2' => '.981',
'attr3' => '561A',
'groups' => [
{
'date' => '10/MAY/2010',
'nnn' => [ '24.15.30', '13.45.03' ],
'records' => [
[ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474', '13', '0' ],
[ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264', '11', '0' ],
[ '06:23:51', 'AW41X', '15.67', '101323.9', '14', '31.264932', '19', '0' ],
],
},
{
'date' => '11/MAY/2010',
'nnn' => [ '24.07.13', '13.44.63' ],
'records' => [
[ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
[ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
],
}
],
}
Run Code Online (Sandbox Code Playgroud)
我很满意结果,但出于某种原因,Larry对perlstyle的建议浮现在脑海中:
仅仅因为你能以某种特定方式做某事并不意味着你应该这样做.
为了完整起见,下面是一个展示结果的工作程序.
#! /usr/bin/perl
use warnings;
use strict;
use feature ':5.10';
use re 'eval';
*ARGV = *DATA;
my $device;
my $record;
my @expect = qw/ fmt1 /;
my %pattern;
%pattern = (
fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
(?{ $device->{attr1} = $1;
@expect = qw< fmt2 >;
})
/x,
fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
(?{ @{$device}{qw< attr2 attr3 >} = ($1,$2);
@expect = qw< fmt3 >;
})
/x,
# e.g., 10/MAY/2010 24.15.30,13.45.03
fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
(?{ my($date,$nnns) = ($1,$2);
push @{ $device->{groups} } =>
{ nnn => [ split m|,| => $nnns ],
date => $date };
@expect = qw< fmt4 >;
})
/x,
# e.g., 05:03:01 AB23X 15.67 101325.72
fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
(\S+) \s+ (\S+) \s+ (\S+)
\s*$
(?{ push @{ $device->{groups}[-1]{records} } =>
[ $1, $2, $3, $4 ];
@expect = qw< fmt4 fmt5 >;
})
/x,
# e.g., * 14 31.30474 13 0
fmt5 => qr/^\* \s+ (\d+) \s+
# tricky: possibly no whitespace after 9-char float
((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
(\d+) \s+ (\d+)
\s*$
(?{ push @{ $device->{groups}[-1]{records}[-1] } =>
$1, $2, $3, $4;
@expect = qw< fmt4 fmt3 fmt2 >;
})
/x,
);
while (<>) {
given($_) {
when (@{[ map $pattern{$_}, @expect]}) {}
default {
die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
}
}
}
use Data::Dumper;
$Data::Dumper::Terse = $Data::Dumper::Indent = 1;
print Dumper $device;
__DATA__
**DEVICE 109523.69142
.981 561A
10/MAY/2010 24.15.30,13.45.03
05:03:01 AB23X 15.67 101325.72
* 14 31.30474 13 0
05:03:15 CR22X 16.72 101325.42
* 14 29.16264 11 0
06:23:51 AW41X 15.67 101323.9
* 14 31.26493219 0
11/MAY/2010 24.07.13,13.44.63
15:57:14 AB23X 15.67 101327.23
* 14 31.30474 13 0
15:59:59 CR22X 16.72 101331.88
* 14 29.16264 11 0
Run Code Online (Sandbox Code Playgroud)