ari*_*nos 0 windows bash awk cygwin batch-file
I have a text file that is over 50GB. It contains many lines, each line is on average around 15 characters. I want each line to be unique (case sensitive). So if a line is exactly the same as another one, it must be removed, without changing the order of the other lines or sorting the file in any way.
My question is different from others because I have a huge file that cannot be handled with other solutions that I searched.
I have tried:
awk !seen[$0]++ bigtextfile.txt > dublicatesremoved.txt
Run Code Online (Sandbox Code Playgroud)
它启动的很好而且很快,但是很快我收到以下错误:
awk: (FILENAME=bigtextfile.txt FNR=19083509) fatal: more_nodes: nextfree: can't allocate 4000 bytes of memory (Not enough space)
Run Code Online (Sandbox Code Playgroud)
当输出文件约为200MB时,会出现上述错误。
还有其他快速方法可以在Windows上执行相同操作吗?
您可以在UNIX机器上或Windows之上的Cygwin上执行此操作:
$ cat file
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.
Loud the winds howl, loud the waves roar,
Speed, bonnie boat, like a bird on the wing,
Thunderclaps rend the air;
Onward! the sailors cry;
Baffled, our foes stand by the shore,
Carry the lad that's born to be King
Follow they will not dare.
Over the sea to Skye.
Run Code Online (Sandbox Code Playgroud)
。
$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.
Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.
Run Code Online (Sandbox Code Playgroud)
上面唯一尝试立即处理整个文件的命令是sort并且sort被设计为使用分页等来完全处理大文件(请参阅https://unix.stackexchange.com/q/279096/133219),所以恕我直言您能够做到这一点的最佳方法。
从开始,cat -n file然后一次将每个命令添加到管道中以查看其功能(见下文),但这只是添加行号,因此我们可以按内容进行唯一排序以获得唯一值,然后按原始排序行号以恢复原始行顺序,然后删除我们在第一步中添加的行号:
$ cat -n file
1 Speed, bonnie boat, like a bird on the wing,
2 Onward! the sailors cry;
3 Carry the lad that's born to be King
4 Over the sea to Skye.
5
6 Loud the winds howl, loud the waves roar,
7 Speed, bonnie boat, like a bird on the wing,
8 Thunderclaps rend the air;
9 Onward! the sailors cry;
10 Baffled, our foes stand by the shore,
11 Carry the lad that's born to be King
12 Follow they will not dare.
13 Over the sea to Skye.
14
Run Code Online (Sandbox Code Playgroud)
。
$ cat -n file | sort -k2 -u
5
10 Baffled, our foes stand by the shore,
3 Carry the lad that's born to be King
12 Follow they will not dare.
6 Loud the winds howl, loud the waves roar,
2 Onward! the sailors cry;
4 Over the sea to Skye.
1 Speed, bonnie boat, like a bird on the wing,
8 Thunderclaps rend the air;
Run Code Online (Sandbox Code Playgroud)
。
$ cat -n file | sort -k2 -u | sort -n
1 Speed, bonnie boat, like a bird on the wing,
2 Onward! the sailors cry;
3 Carry the lad that's born to be King
4 Over the sea to Skye.
5
6 Loud the winds howl, loud the waves roar,
8 Thunderclaps rend the air;
10 Baffled, our foes stand by the shore,
12 Follow they will not dare.
Run Code Online (Sandbox Code Playgroud)
。
$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.
Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.
Run Code Online (Sandbox Code Playgroud)