根据指定列从 CSV 中删除重复项

Question

根据指定列从 CSV 中删除重复项

我正在使用一个 CSV 数据集，如下所示：

year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK- DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,

Run Code Online (Sandbox Code Playgroud)

它包含许多我需要删除的重复项（保留记录的一个实例）。基于我使用的从 CSV 文件中删除重复条目sort -u file.csv --o deduped-file.csv，它适用于以下示例

2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,

Run Code Online (Sandbox Code Playgroud)

但没有捕获类似的例子

2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,

Run Code Online (Sandbox Code Playgroud)

数据不完整，但代表同一事物。

是否可以根据指定字段（例如年份、制造商、品牌、系列、变体）删除重复项？

Answer 1

gle*_*man 6

我将创建前 5 个字段的“键”，然后仅在第一次看到该键时打印一行：

awk -F, '
  {key = $1 FS $2 FS $3 FS $4 FS $5}
  !seen[key]++ 
' file

Run Code Online (Sandbox Code Playgroud)

year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,

Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年，11 月前
查看次数：	2538 次
最近记录：	3 年，11 月前