在该行的字段中查找不包含重复条目的CSV行的最有效方法是什么(不包括空白的内容)？

Question

在该行的字段中查找不包含重复条目的CSV行的最有效方法是什么(不包括空白的内容)？

我想查找CSV文件的所有行,其中包含该行的两个或多个字段中的重复数据(即查找每个字段中没有唯一数据的所有行.)

例如,我有以下CSV文件:

John,Smith,Smith,21
Mary,Jones,Smith,32
John,42,42,42
Henry,Brown,Jones,31
Mary,,,21

Run Code Online (Sandbox Code Playgroud)

我希望打印以下行:

John,Smith,Smith,21
John,42,42,42

Run Code Online (Sandbox Code Playgroud)

打印这些行是因为这些行的一个字段中的数据出现在另一个字段中.请注意,即使它有重复的空字段,也不会打印"Mary ,,, 21".

我可以编写一个Python脚本并记录每行中每个条目发生的次数,但似乎必须有更好的方法来执行此操作.

Answer 1

Sob*_*que 5

使用perl:

perl -F, -lane 'my %s; print if grep { $s{$_}++ } @F'

Run Code Online (Sandbox Code Playgroud)

用途:

-F, 将字段分隔符设置为 ,
-l 自动处理换行
-a 自动分裂
-n把它包装成一个while ( <> ) {循环.
-e 指定exec的代码.

传入数据自动,进入@F,我们使用%s哈希来发现是否存在欺骗.

如果 - 基于您的评论 - 您需要跳过空字段(这将被视为欺骗):

perl -F, -lane 'my %s; print if grep { /./ ? $s{$_}++ : () } @F'

Run Code Online (Sandbox Code Playgroud)

这包括一个三元运算符来测试字段是否为空.

使用Windows进行测试(由于引号不同):

C:\Users\me>perl -F, -lane "my %s; print qq{line matches:$_} if grep { /./ ? $s{$_}++ : () } @F"
line matches:John,Smith,Smith,21
line matches:John,42,42,42

Run Code Online (Sandbox Code Playgroud)

如果写得很简单,它看起来更像是这样的:

#!/usr/bin/env perl
use strict;
use warnings;

while ( my $line = <DATA> ) {
   my %seen;
   chomp($line); 
   my @fields = split /,/, $line; 
   if ( grep { /./ and $seen{$_}++ } @fields ) { 
       print $line,"\n";
   }
}

__DATA__
John,Smith,Smith,21
Mary,Jones,Smith,32
John,42,42,42
Henry,Brown,Jones,31
Mary,,,21

Run Code Online (Sandbox Code Playgroud)

您可以使用该Text::CSV模块解析它,但我建议不要这样做,除非您专门处理引用/嵌入式换行等.

例如:

#!/usr/bin/env perl
use strict;
use warnings;

use Data::Dumper;
use Text::CSV; 

my $csv = Text::CSV -> new ( {sep_char => ',', eol => "\n", binary => 1} ); 

while ( my $row = $csv -> getline ( \*DATA ) ) {
   my %seen; 
   if ( grep { /./ and $seen{$_}++ } @$row ) { 
       print join ",", @$row, "\n";
   }
}

__DATA__
John,Smith,Smith,21
Mary,Jones,Smith,32
John,42,42,42
Henry,Brown,Jones,31
Mary,,,21

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，1 月前
查看次数：	102 次
最近记录：	10 年，1 月前