从逗号分隔的字符串中提取没有尾随空格的第二个单词,最可读的正则表达式是什么?

DVK*_*DVK 4 regex perl readability

我有一个形式的字符串数组:

@source = (
     "something,something2,third"
    ,"something,something3   ,third"
    ,"something,something4"
    ,"something,something 5" # Note the space in the middle of the word
);
Run Code Online (Sandbox Code Playgroud)

我需要一个正则表达式,它将提取逗号分隔的第二个单词,但没有尾随空格,将第二个单词放在一个数组中.

@expected_result = ("something2","something3","something4","something 5");
Run Code Online (Sandbox Code Playgroud)

实现这一目标的最可读方式是什么?

我有3种可能性,这两种可能性都不是最佳可读性:

  1. 纯正则表达式然后捕获1美元

    @result = map { (/[^,]+,([^,]*[^, ]) *(,|$)/ )[0] } @source;
    
    Run Code Online (Sandbox Code Playgroud)
  2. 在逗号上拆分(这不是CSV,因此不需要解析),然后修剪:

    @result = map { my @s = split(","), $s[1] =~ s/ *$//; $s[1] } @source;
    
    Run Code Online (Sandbox Code Playgroud)
  3. 将分割和修剪放入嵌套的maps中

    @result = map { s/ *$//; $_ } map { (split(","))[1] } @source;
    
    Run Code Online (Sandbox Code Playgroud)

哪一种更好?我还没有想到的任何其他更可读的替代方案?

rua*_*akh 6

在这些可能性中,我认为#2是最清晰的,但我认为我会稍微调整它以包含以下空格split:

@result = map { my @s = split(/ *(?:,|$)/); $s[1] } @source;
Run Code Online (Sandbox Code Playgroud)

(就此而言,我可能实际上是/[ ]*(?:,|$)/用无操作字符类编写的,只是因为它更加明显*是量化的.)

编辑添加:哎呀,我之前有一个愚蠢的错误,这不会删除类似的尾随空格"foo, bar ".现在我已经解决了这个错误,结果并不那么简单,而且我不再确定我是否推荐上述内容!


Gre*_*con 6

使用命名捕获组并为子模式指定名称(DEFINE)以极大地提高可读性.

#! /usr/bin/env perl

use strict;
use warnings;

use 5.10.0;  # for named capture buffer and (?&...)

my $second_trimmed_field_pattern = qr/
  (?&FIRST_FIELD) (?&SEP) (?<f2> (?&SECOND_FIELD))

  (?(DEFINE)
    # The separator is a comma preceded by optional whitespace.
    # NOTE: the format simple comma separators, NOT full CSV, so
    # we don't have to worry about processing escapes or quoted
    # fields.
    (?<SEP>  \s* ,)

    # A field stops matching as soon as it sees a separator
    # or end-of-string, so it matches in similar fashion to
    # a pattern with a non-greedy quantifier.
    (?<FIELD> (?: (?! (?&SEP) | $) .)+ )

    # The first field is anchored at start-of-string.
    (?<FIRST_FIELD>  ^  (?&FIELD))

    # The second field looks like any other field. The name
    # captures our intent for its use in the main pattern.
    (?<SECOND_FIELD> (?&FIELD))
  )
/x;
Run Code Online (Sandbox Code Playgroud)

在行动:

my @source = (
     "something,something2,third"
    ,"something,something3   ,third"
    ,"something,something4"
    ,"something,something 5" # Note the space in the middle of the word
);

for (@source) {
  if (/$second_trimmed_field_pattern/) {
    print "[$+{f2}]\n";

    #print "[$1]\n";  # or do it the old-fashioned way
  }
  else {
    chomp;
    print "no match for [$_]\n";
  }
}
Run Code Online (Sandbox Code Playgroud)

输出:

[something2]
[something3]
[something4]
[something 5]

你可以用旧的perls表达它.下面,我将这些部分限制在子词汇的范围内,以表明它们作为一个整体一起工作.

sub make_second_trimmed_field_pattern {
  my $sep = qr/
    # The separator is a comma preceded by optional whitespace.
    # NOTE: the format simple comma separators, NOT full CSV, so
    # we don't have to worry about processing escapes or quoted
    # fields.

    \s* ,
  /x;

  my $field = qr/
    # A field stops matching as soon as it sees a separator
    # or end-of-string, so it matches in similar fashion to
    # a pattern with a non-greedy quantifier.
    (?:
        # the next character to be matched is not the
        # beginning of a separator sequence or
        # end-of-string
        (?! $sep | $ )

        # ... so consume it
        .
    )+  # ... as many times as possible
  /x;

  qr/ ^ $field $sep ($field) /x;
}
Run Code Online (Sandbox Code Playgroud)

使用它作为

my @source = ...;  # same as above

my $second_trimmed_field_pattern = make_second_trimmed_field_pattern;
for (@source) {
  if (/$second_trimmed_field_pattern/) {
    print "[$1]\n";
  }
  else {
    chomp;
    print "no match for [$_]\n";
  }
}
Run Code Online (Sandbox Code Playgroud)

输出:

$ perl5.8.8 prog
[something2]
[something3]
[something4]
[something 5]