我的MATLAB程序正在读取一个大约7米行的文件,并且在I/O上浪费了太多时间.我知道每一行都被格式化为两个整数,但我不确切知道他们占用了多少个字符.str2num是死的慢,我应该使用什么matlab函数?
Catch:我必须一次操作一行,而不存储整个文件内存,所以没有读取整个矩阵的命令都在桌面上.
fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = str2num(tline);
%do stuff with nums
tline = fgetl(fid);
end
fclose(fid);
Run Code Online (Sandbox Code Playgroud)
Pur*_*uit 61
这是一个共同的斗争,没有什么比测试更能回答了.以下是我的假设:
格式良好的ASCII文件,包含两列数字.没有标题,没有不一致的行等.
该方法必须扩展为读取太大而无法包含在内存中的文件(尽管我的耐心有限,因此我的测试文件只有500,000行).
实际操作(OP调用"用nums做什么")必须一次执行一行,不能进行矢量化.
考虑到这一点,答案和评论似乎在三个方面鼓励效率:
我整理了一个快速脚本来测试这些主题的6种变体的摄取速度(以及结果的一致性).结果是:
原始时间的一半以上(68-> 27秒)在str2num调用中效率低下,可以通过切换sscanf来消除.
通过使用较大的批次进行文件读取和字符串到数字转换,可以减少剩余时间的另外2/3(27 - > 8秒).
如果我们愿意违反原始帖子中的第3条规则,则可以通过切换到完全数字处理来减少另外7/8的时间.但是,有些算法不适用于此,所以我们不管它.(不是"检查"值与最后一个条目不匹配.)
最后,与此响应中的上一次编辑直接矛盾,通过切换可用的缓存Java单行读取器,无法节省成本.实际上,该解决方案比使用本机读取器的可比单行结果慢2-3倍.(63对27秒).
上面描述的所有解决方案的示例代码包括在内.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Create a test file
cd(tempdir);
fName = 'demo_file.txt';
fid = fopen(fName,'w');
for ixLoop = 1:5
d = randi(1e6, 1e5,2);
fprintf(fid, '%d, %d \n',d);
end
fclose(fid);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Initial code
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = str2num(tline);
CHECK = round((CHECK + mean(nums) ) /2);
tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Initial code. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using sscanf, once per line
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = sscanf(tline,'%d, %d');
CHECK = round((CHECK + mean(nums) ) /2);
tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Using sscanf, once per line. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using fscanf in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
while ~isempty(scannedData)
for ix = 1:size(scannedData,1)
nums = scannedData(ix,:);
CHECK = round((CHECK + mean(nums) ) /2);
end
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
end
fclose(fid);
t = toc;
fprintf(1,'Using fscanf in large batches. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using textscan in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
while ~isempty(scannedData{1})
for ix = 1:size(scannedData{1},1)
nums = [scannedData{1}(ix) scannedData{2}(ix)];
CHECK = round((CHECK + mean(nums) ) /2);
end
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
end
fclose(fid);
t = toc;
fprintf(1,'Using textscan in large batches. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, incrementing to end-of-line, sscanf
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char'); %This can be slightly optimized
end
data = [dataBatch dataIncrement];
while ~isempty(data)
scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
for ix = 1:size(scannedData,1)
nums = scannedData(ix,:);
CHECK = round((CHECK + mean(nums) ) /2);
end
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
end
data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Reading large batches into memory, then sscanf. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java single line readers + sscanf
CHECK = 0;
tic;
bufferSize = 1e4;
reader = java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
tline = char(reader.readLine());
while ~isempty(tline)
nums = sscanf(tline,'%d, %d');
CHECK = round((CHECK + mean(nums) ) /2);
tline = char(reader.readLine());
end
reader.close();
t = toc;
fprintf(1,'Using java single line file reader and sscanf on single lines. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java scanner for file reading and string conversion
CHECK = 0;
tic;
jFile = java.io.File('demo_file.txt');
scanner = java.util.Scanner(jFile);
scanner.useDelimiter('[\s\,\n\r]+');
while scanner.hasNextInt()
nums = [scanner.nextInt() scanner.nextInt()];
CHECK = round((CHECK + mean(nums) ) /2);
end
scanner.close();
t = toc;
fprintf(1,'Using java single item token scanner. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, vectorized operations (non-compliant solution)
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char'); %This can be slightly optimized
end
data = [dataBatch dataIncrement];
while ~isempty(data)
scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
CHECK = round((CHECK + mean(scannedData(:)) ) /2);
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
end
data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Fully batched operations. %3.2f sec. %d check \n', t, CHECK);
Run Code Online (Sandbox Code Playgroud)
(原始答案)
为了扩展Ben的观点......如果你逐行阅读这些文件,你的瓶颈将始终是文件I/O.
据我所知,有时你无法将整个文件放入内存中.我通常会读取大量字符(1e5,1e6或其左右,具体取决于系统的内存).然后我要么读取额外的单个字符(或者退回单个字符)以得到一个轮数,然后运行你的字符串解析(例如sscanf).
然后,如果您愿意,可以在重复该过程之前一次处理一行的结果大矩阵,直到您读取文件的结尾.
这有点乏味,但并不那么难.与单线阅读器相比,我通常看到速度提高90%以上.
(糟糕的想法使用Java批处理读取器删除羞耻)
| 归档时间: |
|
| 查看次数: |
31948 次 |
| 最近记录: |