提高在C中合并两个文件的IO性能

mar*_*n s 6 c io performance

我写了一个函数,它将两个大文件(file1,file2)合并到一个新文件(outputFile)中.每个文件都是基于行的格式,而条目以\ 0字节分隔.两个文件都具有相同数量的空字节.

一个包含两个条目的示例文件可能如下所示 A\nB\n\0C\nZ\nB\n\0

   Input:
   file1: A\nB\0C\nZ\nB\n\0
   file2: BBA\nAB\0T\nASDF\nQ\n\0
   Output
   outputFile: A\nB\nBBA\nAB\0C\nZ\nB\nT\nASDF\nQ\n\0

FILE * outputFile = fopen(...);
setvbuf ( outputFile  , NULL , _IOFBF , 1024*1024*1024 )
FILE * file1 = fopen(...); 
FILE * file2 = fopen(...); 
int c1, c2;
while((c1=fgetc(file1)) != EOF) {
    if(c1 == '\0'){
        while((c2=fgetc(file2)) != EOF && c2 != '\0') {
            fwrite(&c2, sizeof(char), 1, outputFile);
        }
        char nullByte = '\0';
        fwrite(&nullByte, sizeof(char), 1, outputFile);
    }else{
        fwrite(&c1, sizeof(char), 1, outputFile);
    }
}
Run Code Online (Sandbox Code Playgroud)

有没有办法提高这个功能的IO性能?我outputFile通过使用将缓冲区大小增加到1 GB setvbuf.posix_fadvise在file1和file2 上使用会有帮助吗?

And*_*nle 1

你正在逐个字符地进行 IO。即使使用缓冲流,这也将是不必要且痛苦的慢。

利用数据以 NUL 结尾的字符串存储在文件中这一事实。

假设您从每个文件中交替使用以 null 结尾的字符串,并在 POSIX 平台上运行,这样您就可以简单地mmap()输入文件:

typedef struct mapdata
{
    const char *ptr;
    size_t bytes;
} mapdata_t;

mapdata_t mapFile( const char *filename )
{
    mapdata_t data;
    struct stat sb;

    int fd = open( filename, O_RDONLY );
    fstat( fd, &sb );

    data.bytes = sb.st_size;

    /* assumes we have a NUL byte after the file data 
       If the size of the file is an exact multiple of the
       page size, we won't have the terminating NUL byte! */
    data.ptr = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
    close( fd );
    return( data );
}

void unmapFile( mapdata_t data )
{
    munmap( data.ptr, data.bytes );
}

void mergeFiles( const char *file1, const char *file2, const char *output )
{
    char zeroByte = '\0';

    mapdata_t data1 = mapFile( file1 );
    mapdata_t data2 = mapFile( file2 );

    size_t strOffset1 = 0UL;
    size_t strOffset2 = 0UL;

    /* get a page-aligned buffer - a 64kB alignment should work */
    char *iobuffer = memalign( 64UL * 1024UL, 1024UL * 1024UL );

    /* memset the buffer to ensure the virtual mappings exist */
    memset( iobuffer, 0, 1024UL * 1024UL );

    /* use of direct IO should reduce memory pressure - the 1 MB
       buffer is already pretty large, and since we're not seeking
       the page cache is really only slowing things down */
    int fd = open( output, O_RDWR | O_TRUNC | O_CREAT | O_DIRECT, 0644 );

    FILE *outputfile = fdopen( fd, "wb" );
    setvbuf( outputfile, iobuffer, _IOFBF, 1024UL * 1024UL );

    /* loop until we reach the end of either mapped file */
    for ( ;; )
    {
        fputs( data1.ptr + strOffset1, outputfile );
        fwrite( &zeroByte, 1, 1, outputfile );

        fputs( data2.ptr + strOffset2, outputfile );
        fwrite( &zeroByte, 1, 1, outputfile );

        /* skip over the string, assuming there's one NUL
           byte in between strings */
        strOffset1 += 1 + strlen( data1.ptr + strOffset1 );
        strOffset2 += 1 + strlen( data2.ptr + strOffset2 );

        /* if either offset is too big, end the loop */
        if ( ( strOffset1 >= data1.bytes ) ||
             ( strOffset2 >= data2.bytes ) )
        {
            break;
        }
    }

    fclose( outputfile );

    unmapFile( data1 );
    unmapFile( data2 );       
}
Run Code Online (Sandbox Code Playgroud)

我根本没有进行任何错误检查。您还需要添加正确的头文件。

另请注意,假定文件数据不是系统页面大小的精确倍数,从而确保在文件内容之后映射 NUL 字节。如果文件的大小是页面大小的精确倍数,则必须mmap()在文件内容之后添加一个附加页面,以确保有一个 NUL 字节来终止最后一个字符串。

或者您可以依赖 NUL 字节作为文件内容的最后一个字节。如果事实证明情况并非如此,您可能会得到 SEGV 或损坏的数据。