性能问题 - 循环遍历许多xml文件

Question

性能问题 - 循环遍历许多xml文件

ssb*_*sts 1 xml bash performance awk loops

我有一个多年的每日xml报告,我正在尝试通过每一个,找到购买日期,并确定它是否至少比文件的日期早一年.如果是这样,我将文件的名称和购买日期写入日志.问题是性能真的非常糟糕.

#!/bin/bash

for file in *xml ; do
fileDate=`echo ${file} | cut -c 18-35 | sed 's/.xml//'`
fileDateSeconds=`date --date="${fileDate}" +"%s"`
awk '/PurchaseDate/ {print}' ${file} >> /tmp/yamExport/tempFile.txt
cat /tmp/yamExport/tempFile.txt | while read input
do
        lineDate=`echo ${input} | cut -c 15-24`
        lineDateSeconds=`date --date="${lineDate}" +"%s"`
        delta=`expr $fileDateSeconds - $lineDateSeconds`
        if [ "$delta" -gt "31556926" ]
        then
        #echo "$file : $input"
        echo "$file : $input" >> /tmp/yamExport/yamExportTimestamps2.log
        fi
done
done

Run Code Online (Sandbox Code Playgroud)

起初我只是逐行循环遍历整个文件

cat ${file} | while read input
do
        if [[ "$input" =~ "PurchaseDate" ]]
        then

Run Code Online (Sandbox Code Playgroud)

但后来我确定使用awk快速获取带有PurchaseDate的所有行并输出到临时文件然后循环更快(但仍然很慢).如果有人对如何提高性能有任何建议,那将非常有帮助.我可以在类似于循环的awk语句的输出上操作吗？如果我能做到这一点,我认为性能会更好.

谢谢你的任何提示.

Answer 1

gle*_*man 5

将awk输出写入临时文件肯定会破坏你的性能.此外,您将附加到该临时文件,因此您正在处理每个后续xml文件的第一个xml文件的结果.

此代码最大限度地减少了您需要调用的外部进程数

for file in *xml ; do
    fileDateSeconds=$(date --date="${file:17:18}" +"%s")
    grep -F 'PurchaseDate' "$file" |
    while read input; do
        lineDateSeconds=$(date --date="${input:14:10}" +"%s")
        if (( (fileDateSeconds - lineDateSeconds) > 31556926 )); then
            echo "$file : $input"
        fi
    done
done > /tmp/yamExport/yamExportTimestamps2.log

Run Code Online (Sandbox Code Playgroud)

我换awk到grep,这是在一个文件中查找行更合适的工具.

将重定向移动到外部循环外部的输出文件应该减少文件必须打开的次数.

归档时间：	12 年，1 月前
查看次数：	69 次
最近记录：	12 年，1 月前