key = value
使用以下规则在任意文本中需要匹配对.
( |\t)+
+
字符和一个空格VAR
或CONST
key
和value
使用=
性质例子:
+ VAR somename = somevalue (indented with two spaces)
+ VAR name3 = indented by one \t
Run Code Online (Sandbox Code Playgroud)
以下正则表达式匹配这些行:
/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/
Run Code Online (Sandbox Code Playgroud)
现在的问题是:语法允许连续行,例如当上面的行后面是至少有一个缩进序列( |\t)
(也就是两个空格或一个制表符)的行被认为是一个连续行及其整个内容(带有前导空格) )应该是value
前一行中的键.
例:
+ VAR multi = 3 line value where the continuation lines
are indented (starts with two spaces or one tab)
and NOT followed by the '+'
Run Code Online (Sandbox Code Playgroud)
例如,延续线的正则表达式是
/^( |\t)+([^\+](.*))$/
Run Code Online (Sandbox Code Playgroud)
使用基于行的方法可以轻松实现解决方案,例如,当我将整个文本拆分为行并逐行处理时.
但是,我正在寻找一个(复杂的)正则表达式(主要用于学习和基准测试),它可以匹配一行或多行形式的键=值对.试过这个:
while( $text =~ m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=( |\t)+[^\+](.*)$)*)/gm ) {
...
}
Run Code Online (Sandbox Code Playgroud)
但我得到了:
(?=( |\t)+[^\+](.*)$)* matches null string many times in regex; marked by <-- HERE in m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=( |\t)+[^\+](.*)$)* <-- HERE )/ at so line 36.
Run Code Online (Sandbox Code Playgroud)
附带问题:如何使用多行扩展正则表达式,如:
/
^( |\t)+ # <- space ... :(
\+\s+
(VAR|CONST)
\s+
(\w+)
\s*=\s*
(.*)$
/x
Run Code Online (Sandbox Code Playgroud)
当正则表达式必须包含完全SPACE字符时(例如不能使用通用\s
)?
如果有人需要帮助,这里有一个生成所需输出的代码(使用基于行的方法)以及非工作regex-based
解决方案.
#!/usr/bin/env perl
use 5.014;
use warnings;
use Data::Dumper;
my $txt = do { local $/; <DATA> };
my @matches1 = parse_by_lines($txt // '');
mydump('BY LINES', @matches1);
my @matches2 = parse_by_one_regex($txt // '');
mydump('REGEX', @matches2);
sub parse_by_lines { #produces the wanted output
my ($text) = @_;
my @match;
my $havekey;
for my $line (split "\n", $text) {
if( $line =~ m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/ ) {
push @match, { indent => $1, type => $2, key => $3, val => $4 };
$havekey++;
}
elsif( $havekey && $line =~ m/^( |\t)+([^\+](.*))$/ ) { #continuation line
$match[-1]->{val} .= "\n$line"; #prserve the \n in the val
}
else {
$havekey = 0;
}
}
return @match;
}
sub parse_by_one_regex { #not working
my ($text) = @_;
my @match;
while( $text =~ m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=( |\t)+[^\+](.*)$)*)/gm ) {
push @match, { indent => $1, type => $2, key => $3, val => $4 };
}
return @match;
}
sub mydump {
my($label, @match) = @_;
say "#### $label ####";
for my $m ( @match ) {
printf "%-6s: [%s]\n", $_, $m->{$_} for (qw(indent type key val));
print "\n";
}
}
__DATA__
some arbitrary text lines
or empty lines
could be indented
and could contain any character
+ VAR name1 = var indented by two spaces and the first nonspace character is '+'
line of arbitrary text
+ VAR name2 = var indented by 2x2 spaces
+ VAR name3 = var indented by one \t
+ VAR name4 = the next line with "name5" is not valid. missing the = character, should not be matched
+ VAR name5
+ CONST name6 = the type could be VAR or CONST
+ VAR multi1 = multiline value where the continuation lines
are indented (starts with two spaces or one tab) and NOT followed by the '+'
+ VAR multi1 = multiline value
indented
+ VAR multi1 = multiline value
indented ok too
+ VAR single = this is single line
+ because this line even if it is indented, the first nonspace character is '+'
+ VAR multi2 = multiline
could be
indented
any way
and any number of times
until the first non-indented line
the following should NOT match
+ VAR some = sould not be matched, because the line isn't indented
+ VAR some = sould not be matched, because the line isn't indented at least with TWO spaces or one tab
+ SOME name = value not matched because the SOME isn't VAR or CONST
Run Code Online (Sandbox Code Playgroud)
编辑:使用接受的答案,并添加想要的捕获组,得到以下内容:
while( $text =~ /
(?m) # multiline match
^ # at the start of the line
([ ]{2}|\t)+ # two spaces or tab - at least once
\+ # the '+' character
\s* # followed by any number of spaces (e.g. "+VAR" or "+ VAR" are valid)
(VAR|CONST) # the VAR or CONST
\s+ # followed at least one space (e.g. the "VAR_" should not matched)
(\w+) # the keyword
\s*=\s* # the '=' surrounded (and consumed) by any number of spaces
( # capture the whole value (as it is)
.* # any string up to end of line
(?: # followed by (non-capturing group)
\R # one line-break
^ # at the start of the line
(?>[ ]{2,}|\t+) # atomic group - at least two spaces or at least one tab
[^+] # followed by any character but '+'
.* # any string up the end of line
)* # any number of times (e.g. optionally)
)
/xg) {
push @match, { indent => $1, type => $2, key => $3, val => $4 };
}
Run Code Online (Sandbox Code Playgroud)
EDIT2是的,基于正则表达式的解决方案快34%(至少在我的硬件上).
正则表达式:
(?m)^(?: +|\t+)\+ *(?:VAR|CONST) *\w+ *=.*(?:\R^(?> +|\t+)[^+\s].*)*
Run Code Online (Sandbox Code Playgroud)
重要的部分是最后一个集群:
(?: # Start of non-capturing group (a)
\R # One line-break
^ # Start of line
(?> +|\t+) # At least two spaces or one tab character (possessively)
[^+\s] # Not followed by `+` or a newline character
.* # Up to end of line
)* # Repeat it as much as possible - end of non-capturing group (a)
Run Code Online (Sandbox Code Playgroud)
回答你的第二个问题:
x
在设置修饰符时,文字空格字符将被简单地忽略为正则表达式的有意义的部分,除非将其包含在字符类中[ ]
并使用量词[ ]{2,}
来表示它们应该出现的时间。
/
(?m)
^
(?:
[ ]{2,}
|
\t+
)\+
[ ]*
(?:
VAR
|
CONST
)
[ ]*\w+[ ]*=.*
(?:
\R
^
(?>
[ ]{2,}
|
\t+
)
[^+\s].*
)*
/x
Run Code Online (Sandbox Code Playgroud)