Pre*_*sić 7 sorting powershell text large-files
我有标准的Apache日志文件,大小在500Mb到2GB之间.我需要对它们中的行进行排序(每行以日期yyyy-MM-dd hh:mm:ss开头,因此不需要进行排序处理.
想到的最简单,最明显的事情是
Get-Content unsorted.txt | sort | get-unique > sorted.txt
Run Code Online (Sandbox Code Playgroud)
我猜测(没有尝试过)使用这种方法Get-Content将永远占用我的1GB文件.我不太了解我的方式System.IO.StreamReader,但我很好奇是否可以使用它来组合有效的解决方案?
感谢任何可能有更高效理念的人.
[编辑]
我后来试了这个,花了很长时间; 400MB大约需要10分钟.
Get-Content读取大文件非常无效.Sort-Object也不是很快.
我们设置一个基线:
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
Run Code Online (Sandbox Code Playgroud)
使用具有160万行的40 MB文件(由100k个独特行重复16次组成),此脚本在我的机器上生成以下输出:
Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663
Run Code Online (Sandbox Code Playgroud)
完全不起眼:超过6分钟来排序小文件.每一步都可以改进很多.让我们StreamReader逐行读取文件HashSet,删除重复项,然后将数据复制到List那里并对其进行排序,然后StreamWriter用来转储结果.
$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
while (($line = $reader.ReadLine()) -ne $null)
{
$t = $hs.Add($line)
}
}
finally {
$reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
$f = New-Object System.IO.StreamWriter "d:\result2.txt";
foreach ($s in $ls)
{
$f.WriteLine($s);
}
}
finally
{
$f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
Run Code Online (Sandbox Code Playgroud)
这个脚本产生:
read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802
Run Code Online (Sandbox Code Playgroud)
在相同的输入文件上,它运行速度提高了10倍以上.我仍然感到惊讶,虽然从磁盘读取文件需要30秒.