将一些分隔不佳的数据处理成有用的 CSV

Question

将一些分隔不佳的数据处理成有用的 CSV

Pau*_*aul 13 sed awk shell-script text-processing csv

我有一些以下形式的输出：

count  id     type
588    10 |    3
 10    12 |    3
883    14 |    3
 98    17 |    3
 17    18 |    1
77598    18 |    3
10000    21 |    3
17892     2 |    3
20000    23 |    3
 63    27 |    3
  6     3 |    3
 2446    35 |    3
 14    4 |    3
 15     4 |    1
253     4 |    2
19857     4 |    3
 1000     5 |    3
...

Run Code Online (Sandbox Code Playgroud)

这非常混乱，需要清理成 CSV 文件，这样我就可以将它送给项目经理，让他们把电子表格搞得一团糟。

问题的核心是：我需要这个输出是：

id, sum_of_type_1, sum_of_type_2, sum_of_type_3

一个例子是 id "4"：

14    4 |    3
 15     4 |    1
253     4 |    2
19857     4 |    3

Run Code Online (Sandbox Code Playgroud)

这应该是：

4,15,253,19871

Run Code Online (Sandbox Code Playgroud)

不幸的是，我对这种事情很垃圾，我已经设法将所有行清理干净并转换为 CSV，但我无法对行进行重复数据删除和分组。现在我有这个：

awk 'BEGIN{OFS=",";} {split($line, part, " "); print part[1],part[2],part[4]}' | awk '{ gsub (" ", "", $0); print}'

Run Code Online (Sandbox Code Playgroud)

但所做的只是清理垃圾字符并再次打印行。

将行按摩到上述输出中的最佳方法是什么？

Answer 1

小智 12

一种方法是将所有内容放入哈希中。

# put values into a hash based on the id and tag
awk 'NR>1{n[$2","$4]+=$1}
END{
    # merge the same ids on the one line
    for(i in n){
        id=i;
        sub(/,.*/,"",id);
        a[id]=a[id]","n[i];
    }
    # print everyhing
    for(i in a){
        print i""a[i];
    }
}'

Run Code Online (Sandbox Code Playgroud)

编辑：我的第一个答案没有正确回答问题

Answer 2

cho*_*oba 11

Perl 来拯救：

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

<>;  # Skip the header.

my %sum;
my %types;
while (<>) {
    my ($count, $id, $type) = grep length, split '[\s|]+';
    $sum{$id}{$type} += $count;
    $types{$type} = 1;
}

say join ',', 'id', sort keys %types;
for my $id (sort { $a <=> $b } keys %sum) {
    say join ',', $id, map $_ // q(), @{ $sum{$id} }{ sort keys %types };
}

Run Code Online (Sandbox Code Playgroud)

它保留两个表，类型表和 ID 表。对于每个 id，它存储每种类型的总和。

Answer 3

ste*_*ver 5

如果GNU datamash是您的一个选择，那么

awk 'NR>1 {print $1, $2, $4}' OFS=, file | datamash -t, -s --filler=0 crosstab 2,3 sum 1
,1,2,3
10,0,0,588
12,0,0,10
14,0,0,883
17,0,0,98
18,17,0,77598
2,0,0,17892
21,0,0,10000
23,0,0,20000
27,0,0,63
3,0,0,6
35,0,0,2446
4,15,253,19871
5,0,0,1000

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，8 月前
查看次数：	1560 次
最近记录：	8 年，8 月前