使用 powershell 拆分 CSV

Question

使用 powershell 拆分 CSV

我有大型 CSV 文件（每个 50-500 MB）。在这些上运行复杂的 power shell 命令需要永远和/或遇到内存问题。

处理数据需要按公共字段分组，比如在 ColumnA 中。因此，假设数据已经按该列排序，如果我随机拆分这些文件（即每个 x 千行），那么匹配的条目仍可能以不同的部分结束。A 中有数千个不同的组，因此将每个组拆分为一个文件会创建多个文件。

如何将其拆分为 10,000 行的文件而不丢失组？例如，第 1-13 行将是 A 列中的 A1，第 14-17 行将是 A2 等，第 9997-10012 行将是 A784。在这种情况下，我希望第一个文件包含第 1-10012 行，下一个以第 10013 行开头。

显然我想保留整行（而不仅仅是 A 列），所以如果我将所有结果文件粘贴在一起，这将与原始文件相同。

Answer 1

mjo*_*nor 5

未测试。这假设 ColumnA 是第一列并且它是常见的逗号分隔数据。您需要调整创建正则表达式的行以适合您的数据。

 $count = 0

 $header = get-content file.csv -TotalCount 1

 get-content file.csv -ReadCount 1000 |
  foreach {
   #add tail entries from last batch to beginning of this batch
   $newbatch = $tail + $_ 

   #create regex to match last entry in this batch
   $regex = '^' + [regex]::Escape(($newbatch[-1].split(',')[0])) 

   #Extract everything that doesn't match the last entry to new file

     #Add header if this is not the first file
     if ($count)
       {
         $header |
           set-content "c:\somedir\filepart_$count"
        }

     $newbatch -notmatch $regex | 
      add-content "c:\somedir\filepart_$count"  

   #Extact tail entries to add to next batch
   $tail = @($newbatch -match $regex)

   #Increment file counter
   $count++ 

}

Run Code Online (Sandbox Code Playgroud)

谢谢！（“我在偷那个！”是对编剧的高度赞扬） (2认同)

归档时间：	12 年，10 月前
查看次数：	17747 次
最近记录：	8 年，10 月前