最快的Matlab文件读取?

use*_*403 39 file-io matlab

我的MATLAB程序正在读取一个大约7米行的文件,并且在I/O上浪费了太多时间.我知道每一行都被格式化为两个整数,但我不确切知道他们占用了多少个字符.str2num是死的慢,我应该使用什么matlab函数?

Catch:我必须一次操作一行,而不存储整个文件内存,所以没有读取整个矩阵的命令都在桌面上.

fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);    
    %do stuff with nums
    tline = fgetl(fid);
end
fclose(fid);
Run Code Online (Sandbox Code Playgroud)

Pur*_*uit 61

问题陈述

这是一个共同的斗争,没有什么比测试更能回答了.以下是我的假设:

  1. 格式良好的ASCII文件,包含两列数字.没有标题,没有不一致的行等.

  2. 该方法必须扩展为读取太大而无法包含在内存中的文件(尽管我的耐心有限,因此我的测试文件只有500,000行).

  3. 实际操作(OP调用"用nums做什么")必须一次执行一行,不能进行矢量化.

讨论

考虑到这一点,答案和评论似乎在三个方面鼓励效率:

  • 大批量读取文件
  • 更有效地执行字符串到数字转换(通过批处理或使用更好的函数)
  • 使实际处理更有效率(我已通过上述规则3排除).

结果

我整理了一个快速脚本来测试这些主题的6种变体的摄取速度(以及结果的一致性).结果是:

  • 初始代码. 68.23秒.582582检查
  • 使用sscanf,每行一次. 27.20秒 582582检查
  • 大批量使用fscanf. 8.93秒.582582检查
  • 大批量使用文本扫描. 8.79秒.582582检查
  • 将大批量读入内存,然后sscanf. 8.15秒 582582检查
  • 在单行上使用java单行文件阅读器和sscanf. 63.56秒 582582检查
  • 使用java单项令牌扫描程序. 81.19秒 582582检查
  • 完全批量操作(不合规). 1.02秒 508680检查(违反规则3)

摘要

原始时间的一半以上(68-> 27秒)在str2num调用中效率低下,可以通过切换sscanf来消除.

通过使用较大的批次进行文件读取和字符串到数字转换,可以减少剩余时间的另外2/3(27 - > 8秒).

如果我们愿意违反原始帖子中的第3条规则,则可以通过切换到完全数字处理来减少另外7/8的时间.但是,有些算法不适用于此,所以我们不管它.(不是"检查"值与最后一个条目不匹配.)

最后,与此响应中的上一次编辑直接矛盾,通过切换可用的缓存Java单行读取器,无法节省成本.实际上,该解决方案比使用本机读取器的可比单行结果慢2-3倍.(63对27秒).

上面描述的所有解决方案的示例代码包括在内.


示例代码

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Create a test file
cd(tempdir);
fName = 'demo_file.txt';
fid = fopen(fName,'w');
for ixLoop = 1:5
    d = randi(1e6, 1e5,2);
    fprintf(fid, '%d, %d \n',d);
end
fclose(fid);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Initial code
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Initial code.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using sscanf, once per line
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Using sscanf, once per line.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using fscanf in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
while ~isempty(scannedData)
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
end
fclose(fid);
t = toc;
fprintf(1,'Using fscanf in large batches.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using textscan in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
while ~isempty(scannedData{1})
    for ix = 1:size(scannedData{1},1)
        nums = [scannedData{1}(ix) scannedData{2}(ix)];
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
end
fclose(fid);
t = toc;
fprintf(1,'Using textscan in large batches.  %3.2f sec.  %d check \n', t, CHECK);



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, incrementing to end-of-line, sscanf
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Reading large batches into memory, then sscanf.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java single line readers + sscanf
CHECK = 0;
tic;
bufferSize = 1e4;
reader =  java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
tline = char(reader.readLine());
while ~isempty(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = char(reader.readLine());
end
reader.close();
t = toc;
fprintf(1,'Using java single line file reader and sscanf on single lines.  %3.2f sec.  %d check \n', t, CHECK);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java scanner for file reading and string conversion
CHECK = 0;
tic;
jFile = java.io.File('demo_file.txt');
scanner = java.util.Scanner(jFile);
scanner.useDelimiter('[\s\,\n\r]+');
while scanner.hasNextInt()
    nums = [scanner.nextInt() scanner.nextInt()];
    CHECK = round((CHECK + mean(nums) ) /2);
end
scanner.close();
t = toc;
fprintf(1,'Using java single item token scanner.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, vectorized operations (non-compliant solution)
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    CHECK = round((CHECK + mean(scannedData(:)) ) /2);

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Fully batched operations.  %3.2f sec.  %d check \n', t, CHECK);
Run Code Online (Sandbox Code Playgroud)

(原始答案)

为了扩展Ben的观点......如果你逐行阅读这些文件,你的瓶颈将始终是文件I/O.

据我所知,有时你无法将整个文件放入内存中.我通常会读取大量字符(1e5,1e6或其左右,具体取决于系统的内存).然后我要么读取额外的单个字符(或者退回单个字符)以得到一个轮数,然后运行你的字符串解析(例如sscanf).

然后,如果您愿意,可以在重复该过程之前一次处理一行的结果大矩阵,直到您读取文件的结尾.

这有点乏味,但并不那么难.与单线阅读器相比,我通常看到速度提高90%以上.


(糟糕的想法使用Java批处理读取器删除羞耻)

  • 你测试过这个Java的东西吗?Matlab的fopen I/O已经缓存,就像C的stdio一样; 切换到调用Java类只会增加开销.它对我来说比OP的原始fgetl慢4倍.开销可能不是磁盘I/O本身,而是循环操作在小块数据上运行的开销. (2认同)