fel*_*xmc 3 csv powershell performance if-statement
我有一个工作脚本,其目的是在导入Oracle之前解析格式错误的行的数据文件.要处理450MB csv文件,其中包含大于1百万行的8列,需要2.5小时,最多只需一个CPU内核.小文件快速完成(以秒为单位).
奇怪的是,具有相似行数和40列的350MB文件只需要30分钟.
我的问题是文件会随着时间的推移而增长,并且2.5小时占用CPU并不好.谁能推荐代码优化?一个类似的标题帖推荐了本地路径 - 我已经在做了.
$file = "\Your.csv"
$path = "C:\Folder"
$csv = Get-Content "$path$file"
# Count number of file headers
$count = ($csv[0] -split ',').count
# https://blogs.technet.microsoft.com/gbordier/2009/05/05/powershell-and-writing-files-how-fast-can-you-write-to-a-file/
$stream1 = [System.IO.StreamWriter] "$path\Passed$file-Pass.txt"
$stream2 = [System.IO.StreamWriter] "$path\Failed$file-Fail.txt"
# 2 validation steps: (1) count number of headers is ge (2) Row split after first col. Those right hand side cols must total at least 40 characters.
$csv | Select -Skip 1 | % {
if( ($_ -split ',').count -ge $count -And ($_.split(',',2)[1]).Length -ge 40) {
$stream1.WriteLine($_)
} else {
$stream2.WriteLine($_)
}
}
$stream1.close()
$stream2.close()
Run Code Online (Sandbox Code Playgroud)
示例数据文件:
C1,C2,C3,C4,C5,C6,C7,C8 ABC,000000000000006732,1063,2016-02-20,0,P,ESTIMATE,2015473497A10 ABC,000000000000006732,1110,2016-06-22,0,P,ESTIMATE,2015473497A10 ABC,,2016-06-22,,201501 ,,,,,,,, ABC,000000000000006732,1135,2016-08-28,0,P,ESTIMATE,2015473497B10 ABC,000000000000006732,1167,2015-12-20,0,P,ESTIMATE,2015473497B10
当文件在所有PowerShell版本(包括5.1)上包含数百万行时,在默认模式下生成数组的Get-Content非常慢.更糟糕的是,您将它分配给变量,因此在读取整个文件并将其拆分为行之前,不会发生任何其他情况.在Intel i7 3770K上,3.9GHz的CPU $csv = Get-Content $path需要2分钟以上才能读取350MB的800万行文件.
解决方案:IO.StreamReader用于读取一行并立即处理.
在PowerShell2中,StreamReader不如PS3 +优化,但仍然比Get-Content快.
|通过流控制语句(例如while或foreach语句(不是cmdlet)),流水线操作至少比直接枚举慢几倍.IndexOf和Replace方法(非运算符)来计算字符出现次数.Invoke-Command { }技巧!以下是与PS2兼容的代码.
PS3 +速度更快(我的PC上350MB csv中的800万行为30秒).
$reader = New-Object IO.StreamReader ('r:\data.csv', [Text.Encoding]::UTF8, $true, 4MB)
$header = $reader.ReadLine()
$numCol = $header.Split(',').count
$writer1 = New-Object IO.StreamWriter ('r:\1.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer2 = New-Object IO.StreamWriter ('r:\2.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer1.WriteLine($header)
$writer2.WriteLine($header)
Write-Progress 'Filtering...' -status ' '
$watch = [Diagnostics.Stopwatch]::StartNew()
$currLine = 0
Invoke-Command { # the speed-up trick: disables internal pipeline
while (!$reader.EndOfStream) {
$s = $reader.ReadLine()
$slen = $s.length
if ($slen-$s.IndexOf(',')-1 -ge 40 -and $slen-$s.Replace(',','').length+1 -eq $numCol){
$writer1.WriteLine($s)
} else {
$writer2.WriteLine($s)
}
if (++$currLine % 10000 -eq 0) {
$pctDone = $reader.BaseStream.Position / $reader.BaseStream.Length
Write-Progress 'Filtering...' -status "Line: $currLine" `
-PercentComplete ($pctDone * 100) `
-SecondsRemaining ($watch.ElapsedMilliseconds * (1/$pctDone - 1) / 1000)
}
}
} #Invoke-Command end
Write-Progress 'Filtering...' -Completed -status ' '
echo "Elapsed $($watch.Elapsed)"
$reader.close()
$writer1.close()
$writer2.close()
Run Code Online (Sandbox Code Playgroud)
另一种方法是在两次传递中使用正则表达式(虽然它比上面的代码慢).
由于数组元素属性的简写语法,因此需要PowerShell 3或更高版本:
$text = [IO.File]::ReadAllText('r:\data.csv')
$header = $text.substring(0, $text.indexOfAny("`r`n"))
$numCol = $header.split(',').count
$rx = [regex]"\r?\n(?:[^,]*,){$($numCol-1)}[^,]*?(?=\r?\n|$)"
[IO.File]::WriteAllText('r:\1.csv', $header + "`r`n" +
($rx.matches($text).groups.value -join "`r`n"))
[IO.File]::WriteAllText('r:\2.csv', $header + "`r`n" + $rx.replace($text, ''))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
556 次 |
| 最近记录: |