我想读取一个CSV文件,其大小可达数百GB甚至TB.我有一个限制,我只能以32MB的块读取文件.我对这个问题的解决方案,不仅工作有点慢,而且还可以在它的中间打破一条线.
我想问你是否知道更好的解决方案:
const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
string line;
bool stop = false;
while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0) //reading only 32mb chunks at a time
{
var stream = new StreamReader(new MemoryStream(buffer));
while ((line = stream.ReadLine()) != null)
{
//process line
}
}
}
Run Code Online (Sandbox Code Playgroud)
请不要回答逐行读取文件的解决方案(例如,File.ReadLines这不是一个可接受的解决方案).为什么?因为我只是在寻找另一种解决方案......
您的解决方案的问题是您在每次迭代中重新创建流.试试这个版本:
const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
StringBuilder currentLine = new StringBuilder();
using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
string line;
bool stop = false;
var memoryStream = new MemoryStream(buffer);
var stream = new StreamReader(memoryStream);
while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0)
{
memoryStream.Seek(0, SeekOrigin.Begin);
while (!stream.EndOfStream)
{
line = ReadLineWithAccumulation(stream, currentLine);
if (line != null)
{
//process line
}
}
}
}
private string ReadLineWithAccumulation(StreamReader stream, StringBuilder currentLine)
{
while (stream.Read(buffer, 0, 1) > 0)
{
if (charBuffer [0].Equals('\n'))
{
string result = currentLine.ToString();
currentLine.Clear();
if (result.Last() == '\r') //remove if newlines are single character
{
result = result.Substring(0, result.Length - 1);
}
return result;
}
else
{
currentLine.Append(charBuffer [0]);
}
}
return null; //line not complete yet
}
private char[] charBuffer = new char[1];
Run Code Online (Sandbox Code Playgroud)
注意:如果换行符长度为两个字符并且您需要在结果中包含换行符,则需要进行一些调整.最糟糕的情况是新线对"\r\n"分裂为两个区块.但是,因为你使用ReadLine我假设你不需要这个.
此外,问题是如果您的整个数据只包含一行,这最终会尝试将整个数据读入内存.