读取包含二进制格式文件的大文件并以最小堆分配提取这些文件

Den*_*gin 8 .net c# filestream binary-data large-files

抱歉,这个标题可能有点令人困惑,但我不知道如何更好地解释它。

有两个文件扩展名为 .cat(目录文件)和 .dat。.cat 文件包含 .dat 文件中二进制文件的信息。此信息是文件的名称、文件大小、.dat 文件中的偏移量和 md5 哈希值。

.cat 文件示例;

assets/textures/environments/asteroids/ast_crystal_blue_diff-small.gz 22387 1546955265 85a67a982194e4141e08fac4bf062c8f
assets/textures/environments/asteroids/ast_crystal_blue_diff.gz 83859 1546955265 86c7e940de82c2c2573a822c9efc9b6b
assets/textures/environments/asteroids/ast_crystal_diff-small.gz 22693 1546955265 cff6956c94b59e946b78419d9c90f972
assets/textures/environments/asteroids/ast_crystal_diff.gz 85531 1546955265 57d5a24dd4da673a42cbf0a3e8e08398
assets/textures/environments/asteroids/ast_crystal_green_diff-small.gz 22312 1546955265 857fea639e1af42282b015e8decb02db
assets/textures/environments/asteroids/ast_crystal_green_diff.gz 115569 1546955265 ee6f60b0a8211ec048172caa762d8a1a
assets/textures/environments/asteroids/ast_crystal_purple_diff-small.gz 14179 1546955265 632317951273252d516d36b80de7dfcd
assets/textures/environments/asteroids/ast_crystal_purple_diff.gz 53781 1546955265 c057acc06a4953ce6ea3c6588bbad743
assets/textures/environments/asteroids/ast_crystal_yellow_diff-small.gz 21966 1546955265 a893c12e696f9e5fb188409630b8d10b
assets/textures/environments/asteroids/ast_crystal_yellow_diff.gz 82471 1546955265 c50a5e59093fe9c6abb64f0f47a26e57
assets/textures/environments/asteroids/xen_crystal_diff-small.gz 14161 1546955265 23b34bdd1900a7e61a94751ae798e934
assets/textures/environments/asteroids/xen_crystal_diff.gz 53748 1546955265 dcb7c8294ef72137e7bca8dd8ea2525f
assets/textures/lensflares/lens_rays3_small_diff.gz 14107 1546955265 a656d1fad4198b0662a783919feb91a5
Run Code Online (Sandbox Code Playgroud)

我确实相对轻松地解析了这些文件,并且使用了Span<T>一些基准测试后BenchmarkDotNet,我相信我已经尽可能地优化了这些类型文件的读取。

但 .dat 文件则是另一回事。典型的 .dat 文件大小为 GB。

我首先尝试了我能想到的最直接的方法。

(我删除了空检查和验证代码以使代码更具可读性。)

assets/textures/environments/asteroids/ast_crystal_blue_diff-small.gz 22387 1546955265 85a67a982194e4141e08fac4bf062c8f
assets/textures/environments/asteroids/ast_crystal_blue_diff.gz 83859 1546955265 86c7e940de82c2c2573a822c9efc9b6b
assets/textures/environments/asteroids/ast_crystal_diff-small.gz 22693 1546955265 cff6956c94b59e946b78419d9c90f972
assets/textures/environments/asteroids/ast_crystal_diff.gz 85531 1546955265 57d5a24dd4da673a42cbf0a3e8e08398
assets/textures/environments/asteroids/ast_crystal_green_diff-small.gz 22312 1546955265 857fea639e1af42282b015e8decb02db
assets/textures/environments/asteroids/ast_crystal_green_diff.gz 115569 1546955265 ee6f60b0a8211ec048172caa762d8a1a
assets/textures/environments/asteroids/ast_crystal_purple_diff-small.gz 14179 1546955265 632317951273252d516d36b80de7dfcd
assets/textures/environments/asteroids/ast_crystal_purple_diff.gz 53781 1546955265 c057acc06a4953ce6ea3c6588bbad743
assets/textures/environments/asteroids/ast_crystal_yellow_diff-small.gz 21966 1546955265 a893c12e696f9e5fb188409630b8d10b
assets/textures/environments/asteroids/ast_crystal_yellow_diff.gz 82471 1546955265 c50a5e59093fe9c6abb64f0f47a26e57
assets/textures/environments/asteroids/xen_crystal_diff-small.gz 14161 1546955265 23b34bdd1900a7e61a94751ae798e934
assets/textures/environments/asteroids/xen_crystal_diff.gz 53748 1546955265 dcb7c8294ef72137e7bca8dd8ea2525f
assets/textures/lensflares/lens_rays3_small_diff.gz 14107 1546955265 a656d1fad4198b0662a783919feb91a5
Run Code Online (Sandbox Code Playgroud)

正如您所猜测的,这种方法既慢又在堆中分配大量内存,并且使 GC 保持忙碌。

我对上面的方法做了一些修改,并尝试使用缓冲区读取,然后使用 stackalloc 和 Span 而不是使用new byte[catalogEntry.AssetSize]. 我在缓冲读取中没有获得太多收获,自然地,我使用 stackalloc 得到了 StackOverflow 异常,因为某些文件比堆栈大小大。

然后经过一些研究,我决定可以使用System.IO.Pipelines.NET Core 2.1 中的介绍。我将上面的方法更改如下。

public async Task ExportAssetsAsync(CatalogFile catalogFile, string destDirectory, CancellationToken ct = default)
{
    IFileInfo catalogFileInfo = _fs.FileInfo.FromFileName(catalogFile.FilePath);
    string catalogFileName = _fs.Path.GetFileNameWithoutExtension(catalogFileInfo.Name);
    string datFilePath = _fs.Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");
    IFileInfo datFileInfo = _fs.FileInfo.FromFileName(datFilePath);

    await using Stream stream = datFileInfo.OpenRead();
    
    foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)
    {
        string destFilePath = _fs.Path.Combine(destDirectory, catalogEntry.AssetPath);
        IFileInfo destFile = _fs.FileInfo.FromFileName(destFilePath);
        if (!destFile.Directory.Exists)
        {
            destFile.Directory.Create();
        }
        stream.Seek(catalogEntry.ByteOffset, SeekOrigin.Begin);
        var newFileData = new byte[catalogEntry.AssetSize];
        int read = await stream.ReadAsync(newFileData, 0, catalogEntry.AssetSize, ct);
        if (read != catalogEntry.AssetSize)
        {
            _logger?.LogError("Could not read asset data from dat file: {DatFile}", datFilePath);
            throw new DatFileReadException("Could not read asset data from dat file", datFilePath);
        }
        await using Stream destStream = _fs.File.Open(destFile.FullName, FileMode.Create);
        destStream.Write(newFileData);
        destStream.Close();
    }
}
Run Code Online (Sandbox Code Playgroud)

根据 BenchmarkDotnet 的说法,结果在性能和内存分配方面都比第一种方法更差。这可能是因为我错误地使用了 System.IO.Pipelines 或出于目的。

我对此没有太多经验,因为我以前没有对这么大的文件做过 I/O 操作。我怎样才能用最少的内存分配和最大的性能来做我想做的事情?预先非常感谢您的帮助和正确指导。

Den*_*gin 7

首先,我感谢 Mauricio Atanache 和 Alexei Levenkov 的建议。在尝试他们建议的方法时我学到了很多东西。在完成基准测试后,我决定继续采用 Alexei Levenkov 建议的 SubStream 和 Stream.CopyTo 方法。

\n

首先我想分享一下解决方案。之后,好奇的人可以检查基准和结果。

\n

解决方案

\n

阿列克谢(Alexei)向我提出了一个老问题,我回顾了那里的解决方案并将其改编为我自己的代码。

\n

如何向用户公开我的流的子部分

\n

首先,我需要一个 SubStream 实现,基本上我想做的是从大 .dat 文件中提取小文件。通过使用 SubStream,我可以将文件封装在 FileStream 中我想要的偏移量处。然后,使用Stream.Copy方法,我可以将SubStream中的内容复制到另一个FileStream并将其写入文件系统。使用这种方法,我只进行一次缓冲区分配。

\n
public class SubStream : Stream\n{\n    private readonly Stream _baseStream;\n    private readonly long _length;\n    private long _position;\n\n    public SubStream(Stream baseStream, long offset, long length)\n    {\n        if (baseStream == null)\n        {\n            throw new ArgumentNullException(nameof(baseStream), "Base stream cannot be null");\n        }\n\n        if (!baseStream.CanRead)\n        {\n            throw new ArgumentException("Base stream must be readable.", nameof(baseStream));\n        }\n\n        if (offset < 0)\n        {\n            throw new ArgumentOutOfRangeException(nameof(offset));\n        }\n\n        _baseStream = baseStream;\n        _length = length;\n\n        if (baseStream.CanSeek)\n        {\n            baseStream.Seek(offset, SeekOrigin.Current);\n        }\n        else\n        {\n            // read it manually...\n            const int bufferSize = 512;\n            var buffer = new byte[bufferSize];\n            while (offset > 0)\n            {\n                int read = baseStream.Read(buffer, 0, offset < bufferSize ? (int)offset : bufferSize);\n                offset -= read;\n            }\n        }\n    }\n\n    public override int Read(byte[] buffer, int offset, int count)\n    {\n        CheckDisposed();\n        long remaining = _length - _position;\n        if (remaining <= 0)\n        {\n            return 0;\n        }\n\n        if (remaining < count)\n        {\n            count = (int)remaining;\n        }\n        \n        int read = _baseStream.Read(buffer, offset, count);\n        _position += read;\n        \n        return read;\n    }\n\n    private void CheckDisposed()\n    {\n        if (_baseStream == null)\n        {\n            throw new ObjectDisposedException(GetType().Name);\n        }\n    }\n\n    public override long Length\n    {\n        get\n        {\n            CheckDisposed();\n            return _length;\n        }\n    }\n\n    public override bool CanRead\n    {\n        get\n        {\n            CheckDisposed();\n            return true;\n        }\n    }\n\n    public override bool CanWrite\n    {\n        get\n        {\n            CheckDisposed();\n            return false;\n        }\n    }\n\n    public override bool CanSeek\n    {\n        get\n        {\n            CheckDisposed();\n            return false;\n        }\n    }\n\n    public override long Position\n    {\n        get\n        {\n            CheckDisposed();\n            return _position;\n        }\n        set => throw new NotSupportedException();\n    }\n\n    public override long Seek(long offset, SeekOrigin origin) => throw new NotSupportedException();\n\n    public override void SetLength(long value) => throw new NotSupportedException();\n\n    public override void Write(byte[] buffer, int offset, int count) => throw new NotImplementedException();\n\n    public override void Flush()\n    {\n        CheckDisposed();\n        _baseStream.Flush();\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

该方法的最终版本如下。

\n
private static void ExportAssets(CatalogFile catalogFile, string destDirectory)\n{\n    FileInfo catalogFileInfo = new FileInfo(catalogFile.FilePath);\n    string catalogFileName = Path.GetFileNameWithoutExtension(catalogFileInfo.Name);\n    string datFilePath = Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");\n    FileInfo datFileInfo = new FileInfo(datFilePath);\n\n    using Stream stream = datFileInfo.OpenRead();\n    foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)\n    {\n        string destFilePath = Path.Combine(destDirectory, catalogEntry.AssetPath);\n        FileInfo destFile = new FileInfo(destFilePath);\n\n        if (!destFile.Directory.Exists)\n        {\n            destFile.Directory.Create();\n        }\n\n        using var subStream = new SubStream(stream, catalogEntry.ByteOffset, catalogEntry.AssetSize);\n        using Stream destStream = File.Open(destFile.FullName, FileMode.Create);\n        subStream.CopyTo(destStream);\n        destStream.Close();\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

基准设置

\n

我在进行基准测试时使用的设置

\n
    \n
  • 我使用了两个单独的 .dat 文件,一个 600KB,另一个 550MB。
  • \n
  • 在基准测试中,对文件系统的写入操作会导致结果波动。相反,我使用 MemoryStream 来模拟写入操作。
  • \n
  • 我在基准测试中包含了方法的同步和异步版本。
  • \n
  • 我正在使用System.IO.Abstractions库来模拟文件 IO 操作以进行单元测试。不要对以Fs.(例如Fs.FileInfo.FromFileName(catalogFile.FilePath)) 开头的方法调用感到困惑。
  • \n
\n

使用三种不同版本的方法进行基准测试。

\n

第一个是未优化的版本,它分配new byte[]为 .dat 文件中的每个子文件分配。

\n
private static void ExportAssetsUnoptimized(CatalogFile catalogFile, string destDirectory)\n{\n    IFileInfo catalogFileInfo = Fs.FileInfo.FromFileName(catalogFile.FilePath);\n    string catalogFileName = Fs.Path.GetFileNameWithoutExtension(catalogFileInfo.Name);\n    string datFilePath = Fs.Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");\n    IFileInfo datFileInfo = Fs.FileInfo.FromFileName(datFilePath);\n\n    using Stream stream = datFileInfo.OpenRead();\n\n    foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)\n    {\n        string destFilePath = Fs.Path.Combine(destDirectory, catalogEntry.AssetPath);\n        IFileInfo destFile = Fs.FileInfo.FromFileName(destFilePath);\n\n        if (!destFile.Directory.Exists)\n        {\n            // destFile.Directory.Create();\n        }\n\n        stream.Seek(catalogEntry.ByteOffset, SeekOrigin.Begin);\n        var newFileData = new byte[catalogEntry.AssetSize];\n        int read = stream.Read(newFileData, 0, catalogEntry.AssetSize);\n\n        if (read != catalogEntry.AssetSize)\n        {\n            throw new DatFileReadException("Could not read asset data from dat file", datFilePath);\n        }\n\n        // using Stream destStream = Fs.File.Open(destFile.FullName, FileMode.Create);\n        using var destStream = new MemoryStream();\n        destStream.Write(newFileData);\n        destStream.Close();\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

第二个是System.Buffer中的ArrayPool(这是Mauricio Atanache建议的)。ArrayPool<T>是一个高性能的托管阵列池。您可以在包中找到它System.Buffers,其源代码可以在 GitHub 上找到。它\xe2\x80\x99 已经成熟并可以在生产中使用。

\n

有一篇很好的文章详细解释了这个主题。

\n

使用 ArrayPool 池化大型数组

\n

我仍然怀疑我没有正确使用它或未达到其预期目的。但是当我如下使用它时,我发现与上面未优化的版本相比,它的工作速度更快并且分配了一半。

\n
private static void ExportAssetsWithArrayPool(CatalogFile catalogFile, string destDirectory)\n{\n    IFileInfo catalogFileInfo = Fs.FileInfo.FromFileName(catalogFile.FilePath);\n    string catalogFileName = Fs.Path.GetFileNameWithoutExtension(catalogFileInfo.Name);\n    string datFilePath = Fs.Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");\n    IFileInfo datFileInfo = Fs.FileInfo.FromFileName(datFilePath);\n\n    ArrayPool<byte> bufferPool = ArrayPool<byte>.Shared;\n\n    using Stream stream = datFileInfo.OpenRead();\n    foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)\n    {\n        string destFilePath = Fs.Path.Combine(destDirectory, catalogEntry.AssetPath);\n        IFileInfo destFile = Fs.FileInfo.FromFileName(destFilePath);\n\n        if (!destFile.Directory.Exists)\n        {\n            //destFile.Directory.Create();\n        }\n\n        stream.Seek(catalogEntry.ByteOffset, SeekOrigin.Begin);\n        byte[] newFileData = bufferPool.Rent(catalogEntry.AssetSize);\n        int read = stream.Read(newFileData, 0, catalogEntry.AssetSize);\n\n        if (read != catalogEntry.AssetSize)\n        {\n            throw new DatFileReadException("Could not read asset data from dat file", datFilePath);\n        }\n\n        // using Stream destStream = Fs.File.Open(destFile.FullName, FileMode.Create);\n        using Stream destStream = new MemoryStream();\n        destStream.Write(newFileData, 0, catalogEntry.AssetSize);\n        destStream.Close();\n        bufferPool.Return(newFileData);\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

第三个是最快且分配内存最少的版本。第三个是最快且分配内存最少的版本。我所说的最少内存分配是指分配的内存减少了大约 75 倍,并且速度明显加快。

\n

我已经在答案开头给出了该方法的代码示例并进行了解释。因此,我将跳到基准测试结果。

\n

您可以从下面的要点链接访问完整的 Benchmarkdotnet 设置。

\n

https://gist.github.com/Blind-Striker/8f7e8ff56de6d9c2a4ab7a47ae423eba

\n

基准测试结果

\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
方法文件大小意思是错误标准差第0代第一代第2代已分配
ExportAssetsUnoptimized_Benchmark大_5GB563,034.4 美元13,290.13 我们38,977.64 美元140000.0000140000.0000140000.00001,110,966 KB
ExportAssetsWithArrayPool_Benchmark大_5GB270,394.1 美元5,308.29 美元6,319.15 我们5500.00004000.00004000.0000555,960 KB
ExportAssetsSubStream_Benchmark大_5GB17,525.8 美元183.55 我们171.69 我们3468.75003468.75003468.750014,494 KB
ExportAssetsUnoptimizedAsync_Benchmark大_5GB574,430.4 美元20,442.46 我们59,954.20 美元133000.0000133000.0000133000.00001,111,298 KB
ExportAssetsWithArrayPoolAsync_Benchmark大_5GB237,256.6 美元5,673.63 美元16,728.82 美元1500.0000--556,088 KB
ExportAssetsSubStreamAsync_Benchmark大_5GB32,766.5 美元636.08 我们732.51 我们3187.50002562.50002562.500015,186 KB
ExportAssetsUnoptimized_Benchmark小_600KB680.4 我们13.24 我们23.20 我们166.0156124.0234124.02341,198 KB
ExportAssetsWithArrayPool_Benchmark小_600KB497.9 美元7.54 我们7.06 我们124.511762.011762.0117605 KB
ExportAssetsSubStream_Benchmark小_600KB332.0 我们4.87 我们4.32 我们26.855526.855526.8555223 KB
ExportAssetsUnoptimizedAsync_Benchmark小_600KB739.2 我们5.98 我们5.30 我们186.5234124.0234124.02341,200 KB
ExportAssetsWithArrayPoolAsync_Benchmark小_600KB604.9 秒6.99 我们6.54 我们124.023461.523461.5234607 KB
ExportAssetsSubStreamAsync_Benchmark小_600KB496.6 我们8.02 我们6.70 我们26.855526.855526.8555228KB
\n

结论与免责声明

\n

我得出的结论是 SubStream 和 Stream.CopyTo 方法分配的内存少得多并且运行速度快得多。可能某些分配是因为Path.Combine.

\n

不过,我想提醒您,在我在 Stackoverflow 上发布这个问题之前,我从未使用过 ArrayPool。我可能没有正确使用它或未达到其预期目的。我也不确定使用 MemoryStream 而不是 FileStream 作为写入目标以保持基准一致的准确性如何。

\n