如何使用PowerShell拆分文本文件？

Question

如何使用PowerShell拆分文本文件？

Ral*_*ton 68 powershell

我需要将一个大的(500 MB)文本文件(一个log4net异常文件)拆分成可管理的块,例如100个5 MB的文件就可以了.

我认为这应该是PowerShell的公园散步.我该怎么做？

Answer 1

Typ*_*rus 60

关于一些现有答案的警告 - 对于非常大的文件,它们运行速度非常慢.对于一个1.6 GB的日志文件,我在几个小时后就放弃了,意识到在我第二天回到工作岗位之前它还没有完成.

两个问题:对Add-Content的调用打开,搜索然后关闭源文件中每一行的当前目标文件.每次读取一些源文件并查找新行也会减慢速度,但我的猜测是Add-Content是罪魁祸首.

以下变体产生的输出稍微不那么令人满意:它会在行中间拆分文件,但它会在不到一分钟的时间内拆分我的1.6 GB日志:

$from = "C:\temp\large_log.txt"
$rootName = "C:\temp\large_log_chunk"
$ext = "txt"
$upperBound = 100MB


$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
    do {
        "Reading $upperBound"
        $count = $fromFile.Read($buff, 0, $buff.Length)
        if ($count -gt 0) {
            $to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext)
            $toFile = [io.file]::OpenWrite($to)
            try {
                "Writing $count to $to"
                $tofile.Write($buff, 0, $count)
            } finally {
                $tofile.Close()
            }
        }
        $idx ++
    } while ($count -gt 0)
}
finally {
    $fromFile.Close()
}

Run Code Online (Sandbox Code Playgroud)

我花了几个时间来弄清楚这个脚本是如何工作的.如果有人对此感兴趣的话,我提出了它的要点:https://gist.github.com/awayken/5861923 (8认同)
这种方法对我来说在6GB文件上运行良好,我需要在紧急情况下拆分,以便更有效地分析更小的块.谢谢发帖! (3认同)
你有什么理由不使用 `StreamReader` 吗？这样你就可以用新行拆分？ (2认同)
如果您将这些行添加到脚本的乞讨中以定义变量并修改它们以适合您尝试拆分的文件,那么您将全部设置好!$ from ="C:\ temp\large_log.txt"$ rootName ="C:\ temp\large_log_chunk"$ ext ="txt" (2认同)

Answer 2

Lee*_*Lee 42

对于PowerShell来说,这是一项相当简单的任务,因为标准的Get-Content cmdlet不能很好地处理非常大的文件.我建议做的是使用.NET StreamReader类在PowerShell脚本中逐行读取文件,并使用Add-Contentcmdlet将每行写入文件名中索引不断增加的文件.像这样的东西:

$upperBound = 50MB # calculated by Powershell
$ext = "log"
$rootName = "log_"

$reader = new-object System.IO.StreamReader("C:\Exceptions.log")
$count = 1
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line
    if((Get-ChildItem -path $fileName).Length -ge $upperBound)
    {
        ++$count
        $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
    }
}

$reader.Close()

Run Code Online (Sandbox Code Playgroud)

有用的提示:你可以表达这样的数字...... $ upperBound = 5MB (3认同)
对于那些懒得阅读下一个答案的人,可以通过$ reader = new-object System.IO.StreamReader($ inputFile)设置$ reader对象 (3认同)
我建议在调用 add-content 写入内容之前使用 stringbuilder 连接各个行，否则这种方法非常慢。 (2认同)

Answer 3

Iva*_*van 42

基于行数分割的简单单行(在这种情况下为100):

$i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt}

Run Code Online (Sandbox Code Playgroud)

值得注意的是，它的默认值似乎是 UTF16LE。如果您不需要它，请添加编码类型`Out-File out_$i.txt -Encoding UTF8}` (3认同)

Answer 4

Vin*_*met 32

与此处的所有答案相同,但使用StreamReader/StreamWriter分割新行(逐行,而不是尝试一次将整个文件读入内存).这种方法可以以我所知的最快方式拆分大文件.

注意:我的错误检查很少,所以我不能保证它能很好地适用于您的情况.它适用于我的(1.7 GB TXT文件,400万行,每行文件100,000行,95秒内).

#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = "C:\Users\Vincent\Desktop\test.txt"
$rootName = "C:\Users\Vincent\Desktop\result"
$ext = ".txt"

$linesperFile = 100000#100k
$filecount = 1
$reader = $null
try{
    $reader = [io.file]::OpenText($filename)
    try{
        "Creating file number $filecount"
        $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
        $filecount++
        $linecount = 0

        while($reader.EndOfStream -ne $true) {
            "Reading $linesperFile"
            while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
                $writer.WriteLine($reader.ReadLine());
                $linecount++
            }

            if($reader.EndOfStream -ne $true) {
                "Closing file"
                $writer.Dispose();

                "Creating file number $filecount"
                $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
                $filecount++
                $linecount = 0
            }
        }
    } finally {
        $writer.Dispose();
    }
} finally {
    $reader.Dispose();
}
$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

Run Code Online (Sandbox Code Playgroud)

输出拆分1.7 GB文件:

...
Creating file number 45
Reading 100000
Closing file
Creating file number 46
Reading 100000
Closing file
Creating file number 47
Reading 100000
Closing file
Creating file number 48
Reading 100000
Split complete in  95.6308289 seconds

Run Code Online (Sandbox Code Playgroud)

对于想要使用上述解决方案并且还有重复标题的人,您需要做的一步是在评论之后添加代码 - $ writer.WriteLine($ header) - "Reading $ linesperFile".$ header将是您需要在代码的初始部分中声明所有所需列的变量.感谢@Vincent的快速解决方案 (4认同)
迄今为止最好的解决方案！它速度快并且保持原始编码。上面的其他解决方案读取并重写内容。它们都破坏了语言编码。惊人的。非常感谢！ (2认同)

Answer 5

Jos*_*osh 15

我经常需要做同样的事情.诀窍是将标题重复到每个拆分块中.我编写了以下cmdlet(PowerShell v2 CTP 3)并且它可以解决问题.

##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Count
# (Or -c) The maximum number of lines in each file.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv 3000 -rc 1
#
#.LINK 
# Out-TempFile
##############################################################################
function Split-File {

    [CmdletBinding(DefaultParameterSetName='Path')]
    param(

        [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$Path,

        [Alias("PSPath")]
        [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$LiteralPath,

        [Alias('c')]
        [Parameter(Position=2,Mandatory=$true)]
        [Int32]$Count,

        [Alias('d')]
        [Parameter(Position=3)]
        [String]$Destination='.',

        [Alias('rc')]
        [Parameter()]
        [Int32]$RepeatCount

    )

    process {

        # yeah! the cmdlet supports wildcards
        if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
        elseif ($Path) { $ResolveArgs = @{Path=$Path} }

        Resolve-Path @ResolveArgs | %{

            $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
            $InputExt  = [IO.Path]::GetExtension($_)

            if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

            # get the input file in manageable chunks

            $Part = 1
            Get-Content $_ -ReadCount:$Count | %{

                # make an output filename with a suffix
                $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))

                # In the first iteration the header will be
                # copied to the output file as usual
                # on subsequent iterations we have to do it
                if ($RepeatCount -and $Part -gt 1) {
                    Set-Content $OutputFile $Header
                }

                # write this chunk to the output file
                Write-Host "Writing $OutputFile"
                Add-Content $OutputFile $_

                $Part += 1

            }

        }

    }

}

Run Code Online (Sandbox Code Playgroud)

Answer 6

use*_*448 14

尝试将单个vCard VCF文件中的多个联系人拆分为单独的文件时,我发现了这个问题.这是我根据Lee的代码所做的.我不得不查找如何创建一个新的StreamReader对象并将null更改为$ null.

$reader = new-object System.IO.StreamReader("C:\Contacts.vcf")
$count = 1
$filename = "C:\Contacts\{0}.vcf" -f ($count) 

while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line

    if($line -eq "END:VCARD")
    {
        ++$count
        $filename = "C:\Contacts\{0}.vcf" -f ($count)
    }
}

$reader.Close()

Run Code Online (Sandbox Code Playgroud)

Answer 7

CVe*_*tex 6

其中许多答案对我的源文件来说太慢了.我的源文件是10 MB到800 MB之间的SQL文件,需要拆分成大致相等行数的文件.

我发现以前的一些使用Add-Content的答案非常慢.等待很长时间才能完成分裂并不罕见.

我没有尝试过Typhlosaurus的答案,但它看起来只按文件大小进行拆分,而不是行数.

以下适合我的目的.

$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length

Write-Host "Total Lines :" $totalLines

$skip = 0
$count = 100000; # Number of lines per file

# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")

while ($skip -le $totalLines) {
    $upper = $skip + $count - 1
    if ($upper -gt ($lines.Length - 1)) {
        $upper = $lines.Length - 1
    }

    # Write the lines
    [System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])

    # Increment counters
    $skip += $count
    $fileNumber++
    $fileNumberString = $filenumber.ToString("000")
}

$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

Run Code Online (Sandbox Code Playgroud)

对于54 MB的文件,我得到输出...

Reading source file...
Total Lines : 910030
Split complete in  1.7056578 seconds

Run Code Online (Sandbox Code Playgroud)

我希望其他人能够找到符合我要求的简单的基于行的分割脚本,这会让我觉得很有用.

归档时间：	16 年，2 月前
查看次数：	116077 次
最近记录：	7 年，4 月前