AWK print all regex matches on every line

Question

AWK print all regex matches on every line

I have the following text input:

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

Run Code Online (Sandbox Code Playgroud)

As seen in the text, the appearances of <?> is not fixed and can appear 0 or multiple times on the same line.

Only using awk I need to output this:

<a> <b> <c>
<d> <e>
<f>

Run Code Online (Sandbox Code Playgroud)

I tried this awk script:

awk '{
  match($0,/<[^>]+>/,a);           // fill array a with matches
  for (i in a) {
    if (match(i, /^[0-9]+$/) != 0) // ignore non numeric indices
      print a[i]
  }
}' somefile.txt

Run Code Online (Sandbox Code Playgroud)

but this only outputs the first match on every line:

<a>
<d>
<f>

Run Code Online (Sandbox Code Playgroud)

Is there some way of doing this with match() or any other built-in function?

Answer 1

Rav*_*h13 16

使用 GNU，awk您可以使用名为的 OOTB 变量FPAT，并可以尝试以下awk代码。

awk -v FPAT='<[^>]*>' '
NF{
  val=""
  for(i=1;i<=NF;i++){
    val=(val?val OFS:"") $i
  }
  print val
}
'  Input_file

Run Code Online (Sandbox Code Playgroud)

Answer 2

mar*_*rkp 11

match()并不像你想象的那样工作；要查找可变数量的匹配项，您需要首先查找match()第一个模式，去掉该模式，然后删除下match()一个模式的剩余输入，然后重复，直到当前行中不再有匹配项；例如：

awk '
{ out=sep=""                                     # init variables for new line
  while (match($0,/<[^>]+>/)) {                  # find 1st match
        out=out sep substr($0,RSTART,RLENGTH)    # build up output line
        $0=substr($0,RSTART+RLENGTH)             # strip off 1st match and prep for next while() check
        sep=OFS                                  # set field separator for follow-on matches
  }
  if (out) print out
}' somefile.txt

Run Code Online (Sandbox Code Playgroud)

另一个想法使用该split()函数，例如：

awk '
{ n=split($0,a,/[<>]/)                           # split line on dual delimiters "<" and ">"
  out=sep=""
  for (i=2;i<=n;i=i+2) {                         # step through even numbered array entries; assumes line does not contain any standalone "<" or ">" characters !!!
      out=out sep "<" a[i] ">"                   # build output line
      sep=OFS 
  }
  if (out) print out
}
' somefile.txt

Run Code Online (Sandbox Code Playgroud)

这两者都会生成：

<a> <b> <c>
<d> <e>
<f>

Run Code Online (Sandbox Code Playgroud)

Answer 3

gle*_*man 10

假设没有杂散尖括号，请使用或<作为>字段分隔符并打印每个第二个字段：

awk -F'[<>]' '{for (i=2; i <= NF; i += 2) {printf "<%s> ", $i}; print ""}' data

Run Code Online (Sandbox Code Playgroud)

Answer 4

Fra*_*ona 9

这是一个awk基于正则表达式的简单解决方案：

awk '{ gsub(/^[^<]*|[^>]*$/,""); gsub(/>[^<]*</,"> <") } NF'

Run Code Online (Sandbox Code Playgroud)

^{编辑：使用NF而不是$0 != ""; 谢谢@EdMorton}

对于每行：

删除从左侧到第一个（排除）的所有字符，或者在找不到<时删除到行尾的所有字符。<
删除从右侧到第一个（排除）的所有字符，或者在找不到>时删除到行首的所有字符。>
>将每个和之间的内容替换<为空格字符。
当结果不为空时打印结果

例子

lorem <a a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
<g>incididunt ut<h><i>
h>ell<o
<j>

Run Code Online (Sandbox Code Playgroud)

输出

<a a> <b> <c>
<d> <e>
<f>
<g> <h> <i>
<j>

Run Code Online (Sandbox Code Playgroud)

备注：使用完全相同的逻辑，您可以使用sed：

lorem <a a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
<g>incididunt ut<h><i>
h>ell<o
<j>

Run Code Online (Sandbox Code Playgroud)

Answer 5

anu*_*ava 8

这是一个简单的gnu-awk替代解决方案，使用patsplit：

awk '
n = patsplit($0, m, /<[^>]+>/) {
   for (i=1; i<=n; ++i)
      printf "%s", m[i] (i < n ? OFS : ORS)
}' file

<a> <b> <c>
<d> <e>
<f>

Run Code Online (Sandbox Code Playgroud)

Answer 6

Daw*_*weo 8

AWK我将按照以下方式利用 GNU来完成这项任务，让file.txt内容成为

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

Run Code Online (Sandbox Code Playgroud)

然后

awk 'BEGIN{FPAT="<[^>]*>"}{$1=$1;print}' file.txt

Run Code Online (Sandbox Code Playgroud)

给出输出

<a> <b> <c>
<d> <e>
<f>

Run Code Online (Sandbox Code Playgroud)

解释：我告知 GNUAWK该字段<后面跟着零个或多个 ( *) 非 ( ^)->后面跟着>。对于我所做的每一行，我$1=$1都会引发重建，所以现在行找到了由空格连接的字段，然后我print.

（在 gawk 4.2.1 中测试）

Answer 7

RAR*_*sto 6

输入

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

Run Code Online (Sandbox Code Playgroud)

代码

mawk -F'^[^<]+|[^>]+$' 'gsub(">[^<]*<","> <",$!(NF=NF))^_*/./' OFS=
Run Code Online (Sandbox Code Playgroud)

输出

<a> <b> <c>
<d> <e>
<f>

Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年，4 月前
查看次数：	843 次
最近记录：	3 年，4 月前