总结句子

7 shell sed awk text-processing

我有数据,我想总结句子以得出结论。下面的例子与数据无关,只是为了阐明这个想法,以便我可以复制它。

Employee Suzie signed one time.
Employee Dan signed one time.
Employee Jordan signed one time.
Employee Suzie signed one time.
Employee Suzie signed one time.
Employee Harold signed one time.
Employee Sebastian signed one time.
Employee Jordan signed one time.
Employee Suzie signed one time.
Employee Suzan signed one time.
Run Code Online (Sandbox Code Playgroud)

我想对这些句子做一个总结,像这样:

Jordan signed 2 time(s)
Dan signed 1 time(s)
Suzie signed 4 time(s)
Suzan signed 1 time(s)
Sebastian signed 1 time(s)
Harold signed 1 time(s)
Run Code Online (Sandbox Code Playgroud)

我玩过awk,但似乎很难做到。然后我尝试了sed,但是没有用。它似乎sed只是为了寻找和改变事物。

Kus*_*nda 14

一般的方法是

$ awk '{ count[$2]++ }
       END {
           for (name in count)
               printf("%s signed %d time(s)\n", name, count[name])
       }' <file
Harold signed 1 time(s)
Dan signed 1 time(s)
Sebastian signed 1 time(s)
Suzie signed 4 time(s)
Jordan signed 2 time(s)
Suzan signed 1 time(s)
Run Code Online (Sandbox Code Playgroud)

即,使用关联数组/散列来存储特定名称出现的次数。在END块中,遍历所有名称并打印出每个名称的摘要。

对于稍微更好的格式,%sprintf()调用中的占位符更改为类似%-10s为名称保留 10 个字符(左对齐)的内容。

$ awk '{ count[$2]++ }
       END {
           for (name in count)
               printf("%-10s signed %d time(s)\n", name, count[name])
       }' <file
Harold     signed 1 time(s)
Dan        signed 1 time(s)
Sebastian  signed 1 time(s)
Suzie      signed 4 time(s)
Jordan     signed 2 time(s)
Suzan      signed 1 time(s)
Run Code Online (Sandbox Code Playgroud)

更多地摆弄输出(因为我很无聊):

$ awk '{ count[$2]++ }
       END {
           for (name in count)
               printf("%-10s signed %d time%s\n", name, count[name],
                      count[name] > 1 ? "s" : "" )
       }' <file
Harold     signed 1 time
Dan        signed 1 time
Sebastian  signed 1 time
Suzie      signed 4 times
Jordan     signed 2 times
Suzan      signed 1 time
Run Code Online (Sandbox Code Playgroud)


αғs*_*нιη 8

虽然awk正在使用关联的数组,并且会受到您拥有的内存大小的限制,但您可以改为执行以下操作:

sort -k2,2 infile | uniq -c
Run Code Online (Sandbox Code Playgroud)

或者根据需要进行格式化:

sort -k2,2 infile  |uniq -c |awk '{ print $3, "signed", $1, "time(s)" }'
Run Code Online (Sandbox Code Playgroud)