我需要将一个大的(500 MB)文本文件(一个log4net异常文件)拆分成可管理的块,例如100个5 MB的文件就可以了.
我认为这应该是PowerShell的公园散步.我该怎么做?
Typ*_*rus 60
关于一些现有答案的警告 - 对于非常大的文件,它们运行速度非常慢.对于一个1.6 GB的日志文件,我在几个小时后就放弃了,意识到在我第二天回到工作岗位之前它还没有完成.
两个问题:对Add-Content的调用打开,搜索然后关闭源文件中每一行的当前目标文件.每次读取一些源文件并查找新行也会减慢速度,但我的猜测是Add-Content是罪魁祸首.
以下变体产生的输出稍微不那么令人满意:它会在行中间拆分文件,但它会在不到一分钟的时间内拆分我的1.6 GB日志:
$from = "C:\temp\large_log.txt"
$rootName = "C:\temp\large_log_chunk"
$ext = "txt"
$upperBound = 100MB
$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
do {
"Reading $upperBound"
$count = $fromFile.Read($buff, 0, $buff.Length)
if ($count -gt 0) {
$to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext)
$toFile = [io.file]::OpenWrite($to)
try {
"Writing $count to $to"
$tofile.Write($buff, 0, $count)
} finally {
$tofile.Close()
}
}
$idx ++
} while ($count -gt 0)
}
finally {
$fromFile.Close()
}
Run Code Online (Sandbox Code Playgroud)
Lee*_*Lee 42
对于PowerShell来说,这是一项相当简单的任务,因为标准的Get-Content cmdlet不能很好地处理非常大的文件.我建议做的是使用.NET StreamReader类在PowerShell脚本中逐行读取文件,并使用Add-Content
cmdlet将每行写入文件名中索引不断增加的文件.像这样的东西:
$upperBound = 50MB # calculated by Powershell
$ext = "log"
$rootName = "log_"
$reader = new-object System.IO.StreamReader("C:\Exceptions.log")
$count = 1
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content -path $fileName -value $line
if((Get-ChildItem -path $fileName).Length -ge $upperBound)
{
++$count
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
}
}
$reader.Close()
Run Code Online (Sandbox Code Playgroud)
Iva*_*van 42
基于行数分割的简单单行(在这种情况下为100):
$i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt}
Run Code Online (Sandbox Code Playgroud)
Vin*_*met 32
与此处的所有答案相同,但使用StreamReader/StreamWriter分割新行(逐行,而不是尝试一次将整个文件读入内存).这种方法可以以我所知的最快方式拆分大文件.
注意:我的错误检查很少,所以我不能保证它能很好地适用于您的情况.它适用于我的(1.7 GB TXT文件,400万行,每行文件100,000行,95秒内).
#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = "C:\Users\Vincent\Desktop\test.txt"
$rootName = "C:\Users\Vincent\Desktop\result"
$ext = ".txt"
$linesperFile = 100000#100k
$filecount = 1
$reader = $null
try{
$reader = [io.file]::OpenText($filename)
try{
"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0
while($reader.EndOfStream -ne $true) {
"Reading $linesperFile"
while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
$writer.WriteLine($reader.ReadLine());
$linecount++
}
if($reader.EndOfStream -ne $true) {
"Closing file"
$writer.Dispose();
"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0
}
}
} finally {
$writer.Dispose();
}
} finally {
$reader.Dispose();
}
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
Run Code Online (Sandbox Code Playgroud)
输出拆分1.7 GB文件:
...
Creating file number 45
Reading 100000
Closing file
Creating file number 46
Reading 100000
Closing file
Creating file number 47
Reading 100000
Closing file
Creating file number 48
Reading 100000
Split complete in 95.6308289 seconds
Run Code Online (Sandbox Code Playgroud)
Jos*_*osh 15
我经常需要做同样的事情.诀窍是将标题重复到每个拆分块中.我编写了以下cmdlet(PowerShell v2 CTP 3)并且它可以解决问题.
##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Count
# (Or -c) The maximum number of lines in each file.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv 3000 -rc 1
#
#.LINK
# Out-TempFile
##############################################################################
function Split-File {
[CmdletBinding(DefaultParameterSetName='Path')]
param(
[Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$Path,
[Alias("PSPath")]
[Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$LiteralPath,
[Alias('c')]
[Parameter(Position=2,Mandatory=$true)]
[Int32]$Count,
[Alias('d')]
[Parameter(Position=3)]
[String]$Destination='.',
[Alias('rc')]
[Parameter()]
[Int32]$RepeatCount
)
process {
# yeah! the cmdlet supports wildcards
if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
elseif ($Path) { $ResolveArgs = @{Path=$Path} }
Resolve-Path @ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
# get the input file in manageable chunks
$Part = 1
Get-Content $_ -ReadCount:$Count | %{
# make an output filename with a suffix
$OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))
# In the first iteration the header will be
# copied to the output file as usual
# on subsequent iterations we have to do it
if ($RepeatCount -and $Part -gt 1) {
Set-Content $OutputFile $Header
}
# write this chunk to the output file
Write-Host "Writing $OutputFile"
Add-Content $OutputFile $_
$Part += 1
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
use*_*448 14
尝试将单个vCard VCF文件中的多个联系人拆分为单独的文件时,我发现了这个问题.这是我根据Lee的代码所做的.我不得不查找如何创建一个新的StreamReader对象并将null更改为$ null.
$reader = new-object System.IO.StreamReader("C:\Contacts.vcf")
$count = 1
$filename = "C:\Contacts\{0}.vcf" -f ($count)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content -path $fileName -value $line
if($line -eq "END:VCARD")
{
++$count
$filename = "C:\Contacts\{0}.vcf" -f ($count)
}
}
$reader.Close()
Run Code Online (Sandbox Code Playgroud)
其中许多答案对我的源文件来说太慢了.我的源文件是10 MB到800 MB之间的SQL文件,需要拆分成大致相等行数的文件.
我发现以前的一些使用Add-Content的答案非常慢.等待很长时间才能完成分裂并不罕见.
我没有尝试过Typhlosaurus的答案,但它看起来只按文件大小进行拆分,而不是行数.
以下适合我的目的.
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length
Write-Host "Total Lines :" $totalLines
$skip = 0
$count = 100000; # Number of lines per file
# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")
while ($skip -le $totalLines) {
$upper = $skip + $count - 1
if ($upper -gt ($lines.Length - 1)) {
$upper = $lines.Length - 1
}
# Write the lines
[System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])
# Increment counters
$skip += $count
$fileNumber++
$fileNumberString = $filenumber.ToString("000")
}
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
Run Code Online (Sandbox Code Playgroud)
对于54 MB的文件,我得到输出...
Reading source file...
Total Lines : 910030
Split complete in 1.7056578 seconds
Run Code Online (Sandbox Code Playgroud)
我希望其他人能够找到符合我要求的简单的基于行的分割脚本,这会让我觉得很有用.
归档时间: |
|
查看次数: |
116077 次 |
最近记录: |