通过块读取非常大的文件而不是逐行读取

Yon*_*Nir 6 c# file

我想读取一个CSV文件,其大小可达数百GB甚至TB.我有一个限制,我只能以32MB的块读取文件.我对这个问题的解决方案,不仅工作有点慢,而且还可以在它的中间打破一条线.

我想问你是否知道更好的解决方案:

const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;

using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
    string line;
    bool stop = false;
    while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0) //reading only 32mb chunks at a time
    {
        var stream = new StreamReader(new MemoryStream(buffer));
        while ((line = stream.ReadLine()) != null)
        {
            //process line
        }

    }
}
Run Code Online (Sandbox Code Playgroud)

请不要回答逐行读取文件的解决方案(例如,File.ReadLines这不是一个可接受的解决方案).为什么?因为我只是在寻找另一种解决方案......

Bar*_*zKP 5

您的解决方案的问题是您在每次迭代中重新创建流.试试这个版本:

const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
StringBuilder currentLine = new StringBuilder();

using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
    string line;
    bool stop = false;
    var memoryStream = new MemoryStream(buffer);
    var stream = new StreamReader(memoryStream);
    while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0)
    {
        memoryStream.Seek(0, SeekOrigin.Begin);

        while (!stream.EndOfStream)
        {
            line = ReadLineWithAccumulation(stream, currentLine);

            if (line != null)
            {
                //process line
            }
        }
    }
}

private string ReadLineWithAccumulation(StreamReader stream, StringBuilder currentLine)
{
    while (stream.Read(buffer, 0, 1) > 0)
    {
        if (charBuffer [0].Equals('\n'))
        {
            string result = currentLine.ToString();
            currentLine.Clear();

            if (result.Last() == '\r') //remove if newlines are single character
            {
                result = result.Substring(0, result.Length - 1);
            }

            return result;
        }
        else
        {
            currentLine.Append(charBuffer [0]);
        }
    }

    return null;  //line not complete yet
}

private char[] charBuffer = new char[1];
Run Code Online (Sandbox Code Playgroud)

注意:如果换行符长度为两个字符并且您需要在结果中包含换行符,则需要进行一些调整.最糟糕的情况是新线对"\r\n"分裂为两个区块.但是,因为你使用ReadLine我假设你不需要这个.

此外,问题是如果您的整个数据只包含一行,这最终会尝试将整个数据读入内存.