Sud*_*san 0 bash shell perl sed substr
下面是编写用于处理巨大文件的Shell脚本。它通常逐行读取固定长度的文件,执行子字符串,然后作为分隔文件追加到另一个文件中。它可以完美运行,但是速度太慢。
array=() # Create array
while IFS='' read -r line || [[ -n "$line" ]] # Read a line
do
coOrdinates="$(echo -e "${line}" | grep POSITION | cut -d'(' -f2 | cut -d')' -f1 | cut -d':' -f1,2)"
if [[ -z "${coOrdinates// }" ]];
then
echo "Not adding"
else
array+=("$coOrdinates")
fi
done < "$1_CTRL.txt"
while read -r line;
do
result='"'
for e in "${array[@]}"
do
SUBSTRING1=`echo "$e" | sed 's/.*://'`
SUBSTRING=`echo "$e" | sed 's/:.*//'`
result1=`perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)"`
result1="$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
result=$result$result1'"'',''"'
done
echo $result >> $1_1.txt
done < "$1.txt"
Run Code Online (Sandbox Code Playgroud)
之前,我使用过cut命令并进行了如上所述的更改,但是所花费的时间没有任何改善。能否请您提出可以进行哪些更改以缩短处理时间的建议。
更新:
输入文件的样本内容:
XLS01G702012 000034444132412342134
Run Code Online (Sandbox Code Playgroud)
控制文件:
OPTIONS (DIRECT=TRUE, ERRORS=1000, rows=500000) UNRECOVERABLE
load data
CHARACTERSET 'UTF8'
TRUNCATE
into table icm_rls_clientrel2_hg
trailing nullcols
(
APP_ID POSITION(1:3) "TRIM(:APP_ID)",
RELATIONSHIP_NO POSITION(4:21) "TRIM(:RELATIONSHIP_NO)"
)
Run Code Online (Sandbox Code Playgroud)
输出文件:
"LS0","1G702012 0000"
Run Code Online (Sandbox Code Playgroud)
perl:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
# read the control file
my $ctrl;
{
local $/ = "";
open my $fh, "<", shift @ARGV;
$ctrl = <$fh>;
close $fh;
}
my @positions = ( $ctrl =~ /\((\d+):(\d+)\)/g );
# read the data file
open my $fh, "<", shift @ARGV;
while (<$fh>) {
my @words;
for (my $i = 0; $i < scalar(@positions); $i += 2) {
push @words, substr($_, $positions[$i], $positions[$i+1]);
}
say join ",", map {qq("$_")} @words;
}
close $fh;
Run Code Online (Sandbox Code Playgroud)
perl parse.pl x_CTRL.txt x.txt
Run Code Online (Sandbox Code Playgroud)
"LS0","1G702012 00003"
Run Code Online (Sandbox Code Playgroud)
与您要求的结果不同:
POSITION(m:n)控制文件的语法中,是n长度还是索引?