Java 与 Perl 相比的性能问题

RBa*_*jee 4 java perl performance

我编写了一个 Perl 代码来处理大量 CSV 文件并获取输出,这需要 0.8326 秒才能完成。

my $opname = $ARGV[0];
my @files = `find . -name "*${opname}*.csv";mtime -10 -type f`;
my %hash;
foreach my $file (@files) {
chomp $file;
my $time = $file;
$time =~ s/.*\~(.*?)\..*/$1/;

open(IN, $file) or print "Can't open $file\n";
while (<IN>) {
    my $line = $_;
    chomp $line;

    my $severity = (split(",", $line))[6];
    next if $severity =~ m/NORMAL/i;
    $hash{$time}{$severity}++;
}
close(IN);

}
foreach my $time (sort {$b <=> $a} keys %hash) {
    foreach my $severity ( keys %{$hash{$time}} ) {
        print $time . ',' . $severity . ',' . $hash{$time}{$severity} . "\n";
    }
}
Run Code Online (Sandbox Code Playgroud)

现在我正在用 Java 编写相同的逻辑,但需要 2600 毫秒(即 2.6 秒)才能完成。我的问题是为什么 Java 需要这么长时间?如何达到和Perl一样的速度?注意:我忽略了虚拟机初始化和类加载时间。

    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileFilter;
    import java.io.FileReader;
    import java.io.IOException;
    import java.util.HashMap;
    import java.util.Map;
    import java.util.TreeMap;

    public class MonitoringFileReader {
        static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>(); 
        static String opname;
        public static void testRead(String filepath) throws IOException
        {
            File file = new File(filepath);

            FileFilter fileFilter= new FileFilter() {

                @Override
                public boolean accept(File pathname) {
                    // TODO Auto-generated method stub
                    int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
                    if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
                        return true;
                        }
                    else
                        return false;
                }
            };

            File[] listoffiles= file.listFiles(fileFilter);
        long time= System.currentTimeMillis();  
            for(File mf:listoffiles){
                String timestamp=mf.getName().split("~")[5].replace(".csv", "");
                BufferedReader br= new BufferedReader(new FileReader(mf),1024*500);
                String line;
                Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
                while((line=br.readLine())!=null)
                {
                    String severity=line.split(",")[6];
                    if(!severity.equals("NORMAL"))
                    {
                        tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
                    }
                }
            store.put(timestamp, tmp);
            }
        time=System.currentTimeMillis()-time;
            System.out.println(time+"ms");  
            System.out.println(store);


        }

        public static void main(String[] args) throws IOException
        {
            opname = args[0];
            long time= System.currentTimeMillis();
            testRead("./SMF/data/analyser/archive");
            time=System.currentTimeMillis()-time;
            System.out.println(time+"ms");
        }

    }
Run Code Online (Sandbox Code Playgroud)

文件输入格式(A~B~C~D~E~20150715080000.csv),大约500个文件,每个文件~1MB,

A,B,C,D,E,F,CRITICAL,G
A,B,C,D,E,F,NORMAL,G
A,B,C,D,E,F,INFO,G
A,B,C,D,E,F,MEDIUM,G
A,B,C,D,E,F,CRITICAL,G
Run Code Online (Sandbox Code Playgroud)

Java版本:1.7

////////////////////更新///////////////////

根据下面的评论,我用正则表达式替换了分割,并且性能提高了很多。现在我正在循环中执行此操作,经过 3-10 次迭代后,性能完全可以接受。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

    public class MonitoringFileReader {
        static Map<String, Map<String,Integer>> store= new HashMap<String, Map<String,Integer>>(); 
        static String opname="Etis_Egypt";
        static Pattern pattern1=Pattern.compile("(\\d+\\.)");
        static Pattern pattern2=Pattern.compile("(?:\"([^\"]*)\"|([^,]*))(?:[,])");
        static long currentsystime=System.currentTimeMillis();
        public static void testRead(String filepath) throws IOException
        {
            File file = new File(filepath);

            FileFilter fileFilter= new FileFilter() {

                @Override
                public boolean accept(File pathname) {
                    // TODO Auto-generated method stub
                    int timediffinhr=(int) ((currentsystime-pathname.lastModified())/86400000);
                    if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
                        return true;
                        }
                    else
                        return false;
                }
            };

            File[] listoffiles= file.listFiles(fileFilter);
        long time= System.currentTimeMillis();  
            for(File mf:listoffiles){
                Matcher matcher=pattern1.matcher(mf.getName());
                matcher.find();
                //String timestamp=mf.getName().split("~")[5].replace(".csv", "");
                String timestamp=matcher.group();
                BufferedReader br= new BufferedReader(new FileReader(mf));
                String line;
                Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
                while((line=br.readLine())!=null)
                {
                    matcher=pattern2.matcher(line);
                    matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();
                    //String severity=line.split(",")[6];
                    String severity=matcher.group();
                    if(!severity.equals("NORMAL"))
                    {
                        tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
                    }
                }
                br.close();
            store.put(timestamp, tmp);
            }
        time=System.currentTimeMillis()-time;
            //System.out.println(time+"ms");    
            //System.out.println(store);


        }

        public static void main(String[] args) throws IOException
        {
            //opname = args[0];
            for(int i=0;i<20;i++){
            long time= System.currentTimeMillis();
            testRead("./SMF/data/analyser/archive");
            time=System.currentTimeMillis()-time;


            System.out.println("Time taken for "+i+" is "+time+"ms");
            }
        }

    }
Run Code Online (Sandbox Code Playgroud)

但我现在还有一个问题,

在小数据集上运行时查看结果。

**Time taken for 0 is 218ms
Time taken for 1 is 134ms
Time taken for 2 is 127ms**
Time taken for 3 is 98ms
Time taken for 4 is 90ms
Time taken for 5 is 77ms
Time taken for 6 is 71ms
Time taken for 7 is 72ms
Time taken for 8 is 62ms
Time taken for 9 is 57ms
Time taken for 10 is 53ms
Time taken for 11 is 58ms
Time taken for 12 is 59ms
Time taken for 13 is 46ms
Time taken for 14 is 44ms
Time taken for 15 is 45ms
Time taken for 16 is 53ms
Time taken for 17 is 45ms
Time taken for 18 is 61ms
Time taken for 19 is 42ms
Run Code Online (Sandbox Code Playgroud)

最初的几个例子花费的时间更多,然后减少,..为什么???

谢谢 ,

maa*_*nus 5

由于 JIT 编译,几秒钟的时间不足以让 Java 达到全速。Java 针对运行数小时(或数年)的服务器进行了优化,而不是针对只需要几秒钟的小型实用程序进行了优化。

关于类加载,我猜您不知道egPatternMatcher您间接使用的split以及根据需要加载的。


static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>(); 
Run Code Online (Sandbox Code Playgroud)

Perl 哈希最类似于 Java HashMap,但您使用的TreeMap速度较慢。我想这并不重要,只要注意差异比你想象的要多得多。


 int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
Run Code Online (Sandbox Code Playgroud)

您一次又一次地读取每个文件的时间。即使对于那些名字不以“.csv”结尾的人,您也会这样做。肯定不是这样find的。


String timestamp=mf.getName().split("~")[5].replace(".csv", "");
Run Code Online (Sandbox Code Playgroud)

与 Perl 不同,Java 不缓存正则表达式。据我所知,单个角色的分割会单独优化,但除此之外,使用类似的东西会更好

private static final Pattern FILENAME_PATTERN =
    Pattern.compile("(?:[^~]*~){5}~([^~]*)\\.csv");

Matcher m = FILENAME_PATTERN.matcher(mf.getName());
if (!m.matches) ... do what you want
String timestamp = m.group(1);
Run Code Online (Sandbox Code Playgroud)
 BufferedReader br = new BufferedReader(new FileReader(mf), 1024*500);
Run Code Online (Sandbox Code Playgroud)

这可能是罪魁祸首。默认情况下,它使用平台编码,可能是UTF-8。这通常比 ASCII 或 LATIN-1 慢。据我所知,除非另有说明,Perl 直接使用字节。

半兆字节的缓冲区大小对于任何只需要几秒钟的事情来说都非常大,尤其是当您多次分配它时。请注意,您的 Perl 代码中没有类似的内容。


总而言之,find对于如此短的任务,Perl 确实可能更快。