ddr*_*han 0 awk text-processing
我需要删除有条件的行,第 2 列仅“吃”,并且第 3 列和第 4 列上的组合值已在前一行中出现
我的样本数据 csv 像这样:
a,eating,apple,2
b,throwing,banana,1
c,eating,apple,3
d,eating,apple,1
e,eating,banana,2
f,throwing,apple,2
g,throwing,banana,2
h,throwing,banana,3
i,eating,apple,2
j,eating,apple,3
k,eating,banana,1
l,throwing,banana,2
m,throwing,banana,1
n,throwing,apple,1
o,eating,apple,3
p,eating,banana,2
q,throwing,apple,1
r,throwing,apple,2
s,eating,apple,1
Run Code Online (Sandbox Code Playgroud)
输出应该是这样的
a,eating,apple,2
b,throwing,banana,1
c,eating,apple,3
d,eating,apple,1
e,eating,banana,2
f,throwing,apple,2
g,throwing,banana,2
h,throwing,banana,3
k,eating,banana,1
l,throwing,banana,2
m,throwing,banana,1
n,throwing,apple,1
q,throwing,apple,1
r,throwing,apple,2
Run Code Online (Sandbox Code Playgroud)
假设输入数据是“简单的CSV”,即任何字段中没有嵌入逗号或换行符,那么我们可以awk
像这样使用:
$ awk -F, '$2 != "eating" || !seen[$3,$4]++' file
a,eating,apple,2
b,throwing,banana,1
c,eating,apple,3
d,eating,apple,1
e,eating,banana,2
f,throwing,apple,2
g,throwing,banana,2
h,throwing,banana,3
k,eating,banana,1
l,throwing,banana,2
m,throwing,banana,1
n,throwing,apple,1
q,throwing,apple,1
r,throwing,apple,2
Run Code Online (Sandbox Code Playgroud)
如果第二个逗号分隔字段不是精确的字符串eating
,或者(如果第二个字段是 eating
)如果之前没有见过第三个和第四个字段的组合,则打印当前行。
逻辑表达式
$2 != "eating" || !seen[$3,$4]++
Run Code Online (Sandbox Code Playgroud)
可以重写为
!($2 == "eating" && seen[$3,$4]++)
Run Code Online (Sandbox Code Playgroud)
(这是问题中指定条件的方式)取决于哪种方式最容易理解。这两个表达式是等价的。
这是删除重复行同时保留原始记录顺序的常见惯用方法的简单变体,使用awk
:
awk '!seen[$0]++' file
Run Code Online (Sandbox Code Playgroud)