mmap与malloc:奇怪的表现

spr*_*121 6 linux memory performance multithreading d

我正在编写一些解析日志文件的代码,但需要注意的是这些文件是压缩的,必须在运行时解压缩.这段代码是一段性能敏感的代码,所以我正在尝试各种方法来找到合适的代码.无论我使用多少个线程,我的内存基本上与程序所需的内存一样多.

我发现了一种似乎表现相当不错的方法,我试图理解为什么它提供更好的性能.

这两种方法都有一个读取器线程,一个从管道gzip进程读取并写入大缓冲区.然后在请求下一个日志行时对该缓冲区进行延迟解析,返回基本上是指向缓冲区中不同字段所在位置的指针结构.

代码在D中,但它与C或C++非常相似.

共享变量:

shared(bool) _stream_empty = false;;
shared(ulong) upper_bound = 0;
shared(ulong) curr_index = 0;
Run Code Online (Sandbox Code Playgroud)

解析代码:

//Lazily parse the buffer
void construct_next_elem() {

    while(1) {
        // Spin to stop us from getting ahead of the reader thread
        buffer_empty = curr_index >= upper_bound -1 &&
                       _stream_empty;
        if(curr_index >= upper_bound && !_stream_empty) {
            continue;
        }
        // Parsing logic .....
    }
}
Run Code Online (Sandbox Code Playgroud)

方法1:使用足够大的缓冲区来预先保存解压缩文件.

char[] buffer;                   // Same as vector<char> in C++
buffer.length = buffer_length;   // Same as vector reserve in C++ or malloc
Run Code Online (Sandbox Code Playgroud)

方法2:使用匿名内存映射作为缓冲区

MmFile buffer;
buffer = new MmFile(null,
                    MmFile.Mode.readWrite, // PROT_READ || PROT_WRITE
                    buffer_length,
                    null);                 // MAP_ANON || MAP_PRIVATE
Run Code Online (Sandbox Code Playgroud)

读者主题:

ulong buffer_length = get_gzip_length(file_path);
pipe = pipeProcess(["gunzip", "-c", file_path],
                                   Redirect.stdout);
stream = pipe.stdout();

static void stream_data() {
    while(!l.stream.eof()) {

        // Splice is a reference inside the buffer
        char[] splice = buffer[upper_bound..upper_bound + READ_SIZE];
        ulong read = stream.rawRead(splice).length;
        upper_bound += read;
    }
    // Clean up
}

void start_stream() {
        auto t = task!stream_data();
        t.executeInNewThread();
        construct_next_elem();
}
Run Code Online (Sandbox Code Playgroud)

我从方法1中获得了明显更好的性能,即使在数量级上也是如此

User time (seconds): 112.22
System time (seconds): 38.56
Percent of CPU this job got: 151%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:39.40
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3784992
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 5463
Voluntary context switches: 90707
Involuntary context switches: 2838
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Run Code Online (Sandbox Code Playgroud)

User time (seconds): 275.92
System time (seconds): 73.92
Percent of CPU this job got: 117%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:58.73
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3777336
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 944779
Voluntary context switches: 89305
Involuntary context switches: 9836
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Run Code Online (Sandbox Code Playgroud)

使用方法2获取更多页面错误.

有人可以帮助我阐明为什么使用mmap会有如此明显的性能下降?

如果有人知道有任何更好的方法来解决这个问题,我很乐意听到.

编辑 - - -

改变方法2做:

       char * buffer = cast(char*)mmap(cast(void*)null,
                            buffer_length,
                            PROT_READ | PROT_WRITE,
                            MAP_ANON | MAP_PRIVATE,
                            -1,
                            0);
Run Code Online (Sandbox Code Playgroud)

现在使用简单的MmFile获得3倍的性能提升.我试图弄清楚什么可能导致性能如此明显不同,它本质上只是mmap的包装.

仅使用直接char*mmap vs Mmfile的Perf数字,减少页面错误的方式:

User time (seconds): 109.99
System time (seconds): 36.11
Percent of CPU this job got: 151%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:36.20
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3777896
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2771
Voluntary context switches: 90827
Involuntary context switches: 2999
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Run Code Online (Sandbox Code Playgroud)

Sor*_*ren 0

您会遇到页面错误和速度减慢的情况,因为默认情况下,只有在您尝试访问页面时,mmap 才会加载该页面。

另一方面,Read知道您正在按顺序阅读,因此它会在您请求页面之前提前获取页面。

看一下madvise调用——它的目的是向内核发出信号,表明您打算如何访问 mmap'ed 文件,并允许您为 mmap 内存的不同部分设置不同的策略——例如,您有一个您想要保留在内存中的索引块[MADV_WILLNEED],但内容是根据需要随机访问的[MADV_RANDOM],或者您在顺序扫描中循环访问内存[MADV_SEQUENTIAL]

然而,操作系统完全可以忽略您正在设置的策略,所以 YMMW