如何让“grep -zoP”分别显示每个匹配项？

Question

如何让“grep -zoP”分别显示每个匹配项？

fed*_*qui 5 regex awk grep text-processing

我在这个表格上有一个文件：

X/this is the first match/blabla
X-this is
the second match-

and here we have some fluff.

Run Code Online (Sandbox Code Playgroud)

我想提取出现在“X”之后和相同标记之间的所有内容。所以如果我有“X+match+”，我想得到“match”，因为它出现在“X”之后和标记“+”之间。

因此，对于给定的示例文件，我希望得到以下输出：

this is the first match

Run Code Online (Sandbox Code Playgroud)

进而

this is
the second match

Run Code Online (Sandbox Code Playgroud)

我设法使用以下方法获取 X 和标记之间的所有内容：

grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file

Run Code Online (Sandbox Code Playgroud)

那是：

grep -Po '(?<=X(.))(.|\n)+(?=\1)'匹配 X 后跟(something)被捕获并在最后匹配(?=\1)（我基于我的答案在这里的代码）。
注意我(.|\n)用来匹配任何东西，包括一个新行，我也在-zgrep 中使用它来匹配新行。

所以这很有效，唯一的问题来自输出的显示：

$ grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
this is the first matchthis is
the second match

Run Code Online (Sandbox Code Playgroud)

如您所见，所有匹配项一起出现，“这是第一个匹配项”后跟“这是第二个匹配项”，完全没有分隔符。我知道这来自“-z”的使用，它将所有文件视为一组行，每行都以零字节（ASCII NUL 字符）而不是换行符（引用“man grep”）结尾。

那么：有没有办法分别获得所有这些结果？

我也在 GNU Awk 中尝试过：

awk 'match($0, /X(.)(\n|.*)\1/, a) {print a[1]}' file

Run Code Online (Sandbox Code Playgroud)

但甚至不是(\n|.*)工作。

Answer 1

Sun*_*eep 5

awk 不支持正则表达式定义中的反向引用。

解决方法：

$ grep -zPo '(?s)(?<=X(.)).+(?=\1)' ip.txt | tr '\0' '\n'
this is the first match
this is
the second match

# with ripgrep, which supports multiline matching
$ rg -NoUP '(?s)(?<=X(.)).+(?=\1)' ip.txt
this is the first match
this is
the second match

Run Code Online (Sandbox Code Playgroud)

也可以使用(?s)X(.)\K.+(?=\1)代替(?s)(?<=X(.)).+(?=\1)。此外，您可能希望在此处使用非贪婪量词以避免匹配match+xyz+foobaz输入X+match+xyz+foobaz+

和 perl

$ perl -0777 -nE 'say $& while(/X(.)\K.+(?=\1)/sg)' ip.txt
this is the first match
this is
the second match

Run Code Online (Sandbox Code Playgroud)

非常好，非常感谢，关键是在找到时替换“\0”，我没有注意到输出中提供了该字符。 (2认同)

Answer 2

tri*_*eee 2

该用例有点问题，因为一旦打印匹配项，您就会丢失有关分隔符确切位置的信息。但如果可以接受，请尝试通过管道传输到xargs -r0.

grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file | xargs -r0

Run Code Online (Sandbox Code Playgroud)

这些选项是 GNU 扩展，但grep -z和（大部分）也是如此grep -P，所以也许这是可以接受的。

归档时间：	5 年，2 月前
查看次数：	132 次
最近记录：	5 年，2 月前