I have the following text input:
lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut
Run Code Online (Sandbox Code Playgroud)
As seen in the text, the appearances of <?> is not fixed and can appear 0 or multiple times on the same line.
Only using awk I need to output this:
<a> <b> <c>
<d> <e>
<f>
Run Code Online (Sandbox Code Playgroud)
I tried this awk script:
awk '{
match($0,/<[^>]+>/,a); // fill array a with matches
for (i in a) {
if (match(i, /^[0-9]+$/) != 0) // ignore non numeric indices
print a[i]
}
}' somefile.txt
Run Code Online (Sandbox Code Playgroud)
but this only outputs the first match on every line:
<a>
<d>
<f>
Run Code Online (Sandbox Code Playgroud)
Is there some way of doing this with match() or any other built-in function?
Rav*_*h13 16
使用 GNU,awk您可以使用名为的 OOTB 变量FPAT,并可以尝试以下awk代码。
awk -v FPAT='<[^>]*>' '
NF{
val=""
for(i=1;i<=NF;i++){
val=(val?val OFS:"") $i
}
print val
}
' Input_file
Run Code Online (Sandbox Code Playgroud)
mar*_*rkp 11
match()并不像你想象的那样工作;要查找可变数量的匹配项,您需要首先查找match()第一个模式,去掉该模式,然后删除下match()一个模式的剩余输入,然后重复,直到当前行中不再有匹配项;例如:
awk '
{ out=sep="" # init variables for new line
while (match($0,/<[^>]+>/)) { # find 1st match
out=out sep substr($0,RSTART,RLENGTH) # build up output line
$0=substr($0,RSTART+RLENGTH) # strip off 1st match and prep for next while() check
sep=OFS # set field separator for follow-on matches
}
if (out) print out
}' somefile.txt
Run Code Online (Sandbox Code Playgroud)
另一个想法使用该split()函数,例如:
awk '
{ n=split($0,a,/[<>]/) # split line on dual delimiters "<" and ">"
out=sep=""
for (i=2;i<=n;i=i+2) { # step through even numbered array entries; assumes line does not contain any standalone "<" or ">" characters !!!
out=out sep "<" a[i] ">" # build output line
sep=OFS
}
if (out) print out
}
' somefile.txt
Run Code Online (Sandbox Code Playgroud)
这两者都会生成:
<a> <b> <c>
<d> <e>
<f>
Run Code Online (Sandbox Code Playgroud)
gle*_*man 10
假设没有杂散尖括号,请使用 或<作为>字段分隔符并打印每个第二个字段:
awk -F'[<>]' '{for (i=2; i <= NF; i += 2) {printf "<%s> ", $i}; print ""}' data
Run Code Online (Sandbox Code Playgroud)
这是一个awk基于正则表达式的简单解决方案:
awk '{ gsub(/^[^<]*|[^>]*$/,""); gsub(/>[^<]*</,"> <") } NF'
Run Code Online (Sandbox Code Playgroud)
编辑:使用NF而不是$0 != ""; 谢谢@EdMorton
对于每行:
<时删除到行尾的所有字符。<>时删除到行首的所有字符。>>将每个和之间的内容替换<为空格字符。lorem <a a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
<g>incididunt ut<h><i>
h>ell<o
<j>
Run Code Online (Sandbox Code Playgroud)
<a a> <b> <c>
<d> <e>
<f>
<g> <h> <i>
<j>
Run Code Online (Sandbox Code Playgroud)
备注:使用完全相同的逻辑,您可以使用sed:
lorem <a a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
<g>incididunt ut<h><i>
h>ell<o
<j>
Run Code Online (Sandbox Code Playgroud)
这是一个简单的gnu-awk替代解决方案,使用patsplit:
awk '
n = patsplit($0, m, /<[^>]+>/) {
for (i=1; i<=n; ++i)
printf "%s", m[i] (i < n ? OFS : ORS)
}' file
<a> <b> <c>
<d> <e>
<f>
Run Code Online (Sandbox Code Playgroud)
AWK我将按照以下方式利用 GNU来完成这项任务,让file.txt内容成为
lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut
Run Code Online (Sandbox Code Playgroud)
然后
awk 'BEGIN{FPAT="<[^>]*>"}{$1=$1;print}' file.txt
Run Code Online (Sandbox Code Playgroud)
给出输出
<a> <b> <c>
<d> <e>
<f>
Run Code Online (Sandbox Code Playgroud)
解释:我告知 GNUAWK该字段<后面跟着零个或多个 ( *) 非 ( ^)->后面跟着>。对于我所做的每一行,我$1=$1都会引发重建,所以现在行找到了由空格连接的字段,然后我print.
(在 gawk 4.2.1 中测试)
输入
lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut
Run Code Online (Sandbox Code Playgroud)
代码
Run Code Online (Sandbox Code Playgroud)mawk -F'^[^<]+|[^>]+$' 'gsub(">[^<]*<","> <",$!(NF=NF))^_*/./' OFS=
输出
<a> <b> <c>
<d> <e>
<f>
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
843 次 |
| 最近记录: |