删除多列上具有条件和重复值的行

ddr*_*han 0 awk text-processing

我需要删除有条件的行,第 2 列仅“吃”,并且第 3 列和第 4 列上的组合值已在前一行中出现

我的样本数据 csv 像这样:

a,eating,apple,2
b,throwing,banana,1
c,eating,apple,3
d,eating,apple,1
e,eating,banana,2
f,throwing,apple,2
g,throwing,banana,2
h,throwing,banana,3
i,eating,apple,2
j,eating,apple,3
k,eating,banana,1
l,throwing,banana,2
m,throwing,banana,1
n,throwing,apple,1
o,eating,apple,3
p,eating,banana,2
q,throwing,apple,1
r,throwing,apple,2
s,eating,apple,1
Run Code Online (Sandbox Code Playgroud)

输出应该是这样的

a,eating,apple,2
b,throwing,banana,1
c,eating,apple,3
d,eating,apple,1
e,eating,banana,2
f,throwing,apple,2
g,throwing,banana,2
h,throwing,banana,3
k,eating,banana,1
l,throwing,banana,2
m,throwing,banana,1
n,throwing,apple,1
q,throwing,apple,1
r,throwing,apple,2
Run Code Online (Sandbox Code Playgroud)

Kus*_*nda 5

假设输入数据是“简单的CSV”,即任何字段中没有嵌入逗号或换行符,那么我们可以awk像这样使用:

$ awk -F, '$2 != "eating" || !seen[$3,$4]++' file
a,eating,apple,2
b,throwing,banana,1
c,eating,apple,3
d,eating,apple,1
e,eating,banana,2
f,throwing,apple,2
g,throwing,banana,2
h,throwing,banana,3
k,eating,banana,1
l,throwing,banana,2
m,throwing,banana,1
n,throwing,apple,1
q,throwing,apple,1
r,throwing,apple,2
Run Code Online (Sandbox Code Playgroud)

如果第二个逗号分隔字段不是精确的字符串eating,或者(如果第二个字段 eating)如果之前没有见过第三个和第四个字段的组合,则打印当前行。

逻辑表达式

$2 != "eating" || !seen[$3,$4]++
Run Code Online (Sandbox Code Playgroud)

可以重写为

!($2 == "eating" && seen[$3,$4]++)
Run Code Online (Sandbox Code Playgroud)

(这是问题中指定条件的方式)取决于哪种方式最容易理解。这两个表达式是等价的。

这是删除重复行同时保留原始记录顺序的常见惯用方法的简单变体,使用awk

awk '!seen[$0]++' file
Run Code Online (Sandbox Code Playgroud)