我有一个包含如下行的文本文件:
This is a thread 139737522087680
This is a thread 139737513694976
This is a thread 139737505302272
This is a thread 139737312270080
.
.
.
This is a thread 139737203164928
This is a thread 139737194772224
This is a thread 139737186379520
Run Code Online (Sandbox Code Playgroud)
我如何确定每一行的唯一性?
注意:目标是测试文件,而不是在存在重复行的情况下修改它。
iru*_*var 25
awk解决方案:
awk 'a[$0]++{print "dupes"; exit(1)}' file && echo "no dupes"
Run Code Online (Sandbox Code Playgroud)
Jef*_*ler 24
[ "$(wc -l < input)" -eq "$(sort -u input | wc -l)" ] && echo all unique
Run Code Online (Sandbox Code Playgroud)
jes*_*e_b 22
使用sort
/ uniq
:
sort input.txt | uniq
Run Code Online (Sandbox Code Playgroud)
要仅检查重复行,请使用-d
uniq 选项。这将仅显示重复的行,如果没有,则不显示任何内容:
sort input.txt | uniq -d
Run Code Online (Sandbox Code Playgroud)
最初的问题不清楚,并且读到 OP 只是想要一个文件内容的唯一版本。如下所示。在问题的自更新形式中,OP 现在声明他/她只是想知道文件的内容是否唯一。
您可以简单地用于sort
验证文件是否唯一或包含重复项,如下所示:
$ sort -uC input.txt && echo "unique" || echo "duplicates"
Run Code Online (Sandbox Code Playgroud)
假设我有这两个文件:
重复样本文件$ cat dup_input.txt
This is a thread 139737522087680
This is a thread 139737513694976
This is a thread 139737505302272
This is a thread 139737312270080
This is a thread 139737203164928
This is a thread 139737194772224
This is a thread 139737186379520
Run Code Online (Sandbox Code Playgroud)
独特的样本文件
$ cat uniq_input.txt
A
B
C
D
Run Code Online (Sandbox Code Playgroud)
现在,当我们分析这些文件时,我们可以判断它们是唯一的还是包含重复项:
测试重复文件$ sort -uC dup_input.txt && echo "unique" || echo "duplicates"
duplicates
Run Code Online (Sandbox Code Playgroud)
测试唯一文件
$ sort -uC uniq_input.txt && echo "unique" || echo "duplicates"
unique
Run Code Online (Sandbox Code Playgroud)
只需sort
:
$ sort -u input.txt
This is a thread 139737186379520
This is a thread 139737194772224
This is a thread 139737203164928
This is a thread 139737312270080
This is a thread 139737505302272
This is a thread 139737513694976
This is a thread 139737522087680
Run Code Online (Sandbox Code Playgroud)