我有这些函数来处理2GB的文本文件.我将它分成6个部分进行同步处理,但它仍然需要4个多小时.
我还能尝试什么使脚本更快?
一些细节:
样本数据:
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
Run Code Online (Sandbox Code Playgroud)
脚本:
read2col()
{
is_one_way=$(echo "$line"| awk -F'","' '{print $7}')
price_outbound=$(echo "$line"| awk -F'","' '{print $30}')
price_exc=$(echo "$line"| awk -F'","' '{print $25}')
tax=$(echo "$line"| awk -F'","' '{print $27}')
price_inc=$(echo "$line"| awk -F'","' '{print $26}')
}
#################################################
#for each line in the csv
mainf()
{
cd $infarepath
while read -r line; do
#read the value of csv fields into variables
read2col
if [[ $is_one_way == 0 ]]; then
if [[ $price_outbound > 0 ]]; then
#calculate price inc and print the entire line to txt file
echo $line | awk -v CONVFMT='%.2f' -v pout=$price_outbound -v tax=$tax -F'","' 'BEGIN {OFS = FS} {$25=pout;$26=(pout+(tax / 2)); print}' >>"$csvsplitfile".tmp
else
#divide price ecx and inc by 2 if price outbound is not greater than 0
echo $line | awk -v CONVFMT='%.2f' -v pexc=$price_exc -v pinc=$price_inc -F'","' 'BEGIN {OFS = FS} {$25=(pexc / 2);$26=(pinc /2); print}' >>"$csvsplitfile".tmp
fi
else
echo $line >>"$csvsplitfile".tmp
fi
done < $csvsplitfile
}
Run Code Online (Sandbox Code Playgroud)
pax*_*blo 11
你应该做的第一件事就是停止为每一行输入调用六个子shell awk.让我们做一些快速的,背后的计算.
假设您的输入行大约是292个字符(根据您的示例),2G文件将包含略多于730万行.这意味着你正在启动和停止高达四千四百万个进程.
而且,虽然Linux令人钦佩地处理fork并且exec尽可能高效,但它并非没有成本:
pax$ time for i in {1..44000000} ; do true ; done
real 1m0.946s
Run Code Online (Sandbox Code Playgroud)
此外,bash还没有真正针对这种处理进行优化,其设计导致这种特定用例的次优行为.有关详细信息,请在我们的姐妹网站上查看这个优秀的答案.
下面显示了两种文件处理方法的分析(一个程序读取整个文件(每行都hello在其上),并一次bash读取一行).用于获取时间的两个命令是:
time ( cat somefile >/dev/null )
time ( while read -r x ; do echo $x >/dev/null ; done <somefile )
Run Code Online (Sandbox Code Playgroud)
对于不同的文件大小(user+sys时间,平均几次运行),它非常有趣:
# of lines cat-method while-method
---------- ---------- ------------
1,000 0.375s 0.031s
10,000 0.391s 0.234s
100,000 0.406s 1.994s
1,000,000 0.391s 19.844s
10,000,000 0.375s 205.583s
44,000,000 0.453s 889.402s
Run Code Online (Sandbox Code Playgroud)
由此看来,该while方法似乎可以为较小的数据集保留自己的方法,但它实际上不能很好地扩展.
由于awk自身具有计算和格式化输出的方法,因此使用单个 awk脚本而不是bash/多awk线组合处理文件将使创建所有这些进程和基于行的延迟的成本消失.
这个脚本是一个很好的第一次尝试,让我们称之为prog.awk:
BEGIN {
FMT = "%.2f"
OFS = FS
}
{
isOneWay=$7
priceOutbound=$30
priceExc=$25
tax=$27
priceInc=$26
if (isOneWay == 0) {
if (priceOutbound > 0) {
$25 = sprintf(FMT, priceOutbound)
$26 = sprintf(FMT, priceOutbound + tax / 2)
} else {
$25 = sprintf(FMT, priceExc / 2)
$26 = sprintf(FMT, priceInc / 2)
}
}
print
}
Run Code Online (Sandbox Code Playgroud)
你只运行单一 awk与脚本:
awk -F'","' -f prog.awk data.txt
Run Code Online (Sandbox Code Playgroud)
使用您提供的测试数据,这里是之前和之后,字段编号为25和26的标记:
<-25-> <-26->
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","146.00","222.26","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
"111","2018-08-24","01:21","ZZ","AAA","BBB","0","","","ZZ","ZZ111","ZZ110","2018-10-12","07:00","2018-10-12","08:05","2018-10-19","06:30","2018-10-19","09:35","ZZZZ","ZZZZ","A","B","100.50","138.63","76.26","EEE","abc","100.50","45.50","0","E","ESSENTIAL","ESSENTIAL","4","4","7","125","125"
Run Code Online (Sandbox Code Playgroud)