我写了一个函数,它将两个大文件(file1,file2)合并到一个新文件(outputFile)中.每个文件都是基于行的格式,而条目以\ 0字节分隔.两个文件都具有相同数量的空字节.
一个包含两个条目的示例文件可能如下所示 A\nB\n\0C\nZ\nB\n\0
Input:
file1: A\nB\0C\nZ\nB\n\0
file2: BBA\nAB\0T\nASDF\nQ\n\0
Output
outputFile: A\nB\nBBA\nAB\0C\nZ\nB\nT\nASDF\nQ\n\0
FILE * outputFile = fopen(...);
setvbuf ( outputFile , NULL , _IOFBF , 1024*1024*1024 )
FILE * file1 = fopen(...);
FILE * file2 = fopen(...);
int c1, c2;
while((c1=fgetc(file1)) != EOF) {
if(c1 == '\0'){
while((c2=fgetc(file2)) != EOF && c2 != '\0') {
fwrite(&c2, sizeof(char), 1, outputFile);
}
char nullByte = '\0';
fwrite(&nullByte, sizeof(char), 1, outputFile);
}else{
fwrite(&c1, sizeof(char), 1, outputFile);
}
}
Run Code Online (Sandbox Code Playgroud)
有没有办法提高这个功能的IO性能?我outputFile通过使用将缓冲区大小增加到1 GB setvbuf.posix_fadvise在file1和file2 上使用会有帮助吗?
你正在逐个字符地进行 IO。即使使用缓冲流,这也将是不必要且痛苦的慢。
利用数据以 NUL 结尾的字符串存储在文件中这一事实。
假设您从每个文件中交替使用以 null 结尾的字符串,并在 POSIX 平台上运行,这样您就可以简单地mmap()输入文件:
typedef struct mapdata
{
const char *ptr;
size_t bytes;
} mapdata_t;
mapdata_t mapFile( const char *filename )
{
mapdata_t data;
struct stat sb;
int fd = open( filename, O_RDONLY );
fstat( fd, &sb );
data.bytes = sb.st_size;
/* assumes we have a NUL byte after the file data
If the size of the file is an exact multiple of the
page size, we won't have the terminating NUL byte! */
data.ptr = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
close( fd );
return( data );
}
void unmapFile( mapdata_t data )
{
munmap( data.ptr, data.bytes );
}
void mergeFiles( const char *file1, const char *file2, const char *output )
{
char zeroByte = '\0';
mapdata_t data1 = mapFile( file1 );
mapdata_t data2 = mapFile( file2 );
size_t strOffset1 = 0UL;
size_t strOffset2 = 0UL;
/* get a page-aligned buffer - a 64kB alignment should work */
char *iobuffer = memalign( 64UL * 1024UL, 1024UL * 1024UL );
/* memset the buffer to ensure the virtual mappings exist */
memset( iobuffer, 0, 1024UL * 1024UL );
/* use of direct IO should reduce memory pressure - the 1 MB
buffer is already pretty large, and since we're not seeking
the page cache is really only slowing things down */
int fd = open( output, O_RDWR | O_TRUNC | O_CREAT | O_DIRECT, 0644 );
FILE *outputfile = fdopen( fd, "wb" );
setvbuf( outputfile, iobuffer, _IOFBF, 1024UL * 1024UL );
/* loop until we reach the end of either mapped file */
for ( ;; )
{
fputs( data1.ptr + strOffset1, outputfile );
fwrite( &zeroByte, 1, 1, outputfile );
fputs( data2.ptr + strOffset2, outputfile );
fwrite( &zeroByte, 1, 1, outputfile );
/* skip over the string, assuming there's one NUL
byte in between strings */
strOffset1 += 1 + strlen( data1.ptr + strOffset1 );
strOffset2 += 1 + strlen( data2.ptr + strOffset2 );
/* if either offset is too big, end the loop */
if ( ( strOffset1 >= data1.bytes ) ||
( strOffset2 >= data2.bytes ) )
{
break;
}
}
fclose( outputfile );
unmapFile( data1 );
unmapFile( data2 );
}
Run Code Online (Sandbox Code Playgroud)
我根本没有进行任何错误检查。您还需要添加正确的头文件。
另请注意,假定文件数据不是系统页面大小的精确倍数,从而确保在文件内容之后映射 NUL 字节。如果文件的大小是页面大小的精确倍数,则必须mmap()在文件内容之后添加一个附加页面,以确保有一个 NUL 字节来终止最后一个字符串。
或者您可以依赖 NUL 字节作为文件内容的最后一个字节。如果事实证明情况并非如此,您可能会得到 SEGV 或损坏的数据。
| 归档时间: |
|
| 查看次数: |
215 次 |
| 最近记录: |