Joh*_*nGa 37 java performance stringtokenizer
在我的软件中,我需要将字符串分成单词.我目前拥有超过19,000,000个文档,每个文档超过30个单词.
以下哪两种方法是最好的方法(在性能方面)?
StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {
Run Code Online (Sandbox Code Playgroud)
要么
String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)
Run Code Online (Sandbox Code Playgroud)
Pet*_*rey 63
如果您的数据已经在数据库中,您需要解析字符串,我建议重复使用indexOf.它比任何一种解决方案快很多倍.
但是,从数据库获取数据仍然可能要昂贵得多.
StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
sb.append(i).append(' ');
String sample = sb.toString();
int runs = 100000;
for (int i = 0; i < 5; i++) {
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
StringTokenizer st = new StringTokenizer(sample);
List<String> list = new ArrayList<String>();
while (st.hasMoreTokens())
list.add(st.nextToken());
}
long time = System.nanoTime() - start;
System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
Pattern spacePattern = Pattern.compile(" ");
for (int r = 0; r < runs; r++) {
List<String> list = Arrays.asList(spacePattern.split(sample, 0));
}
long time = System.nanoTime() - start;
System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
List<String> list = new ArrayList<String>();
int pos = 0, end;
while ((end = sample.indexOf(' ', pos)) >= 0) {
list.add(sample.substring(pos, end));
pos = end + 1;
}
}
long time = System.nanoTime() - start;
System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
}
}
Run Code Online (Sandbox Code Playgroud)
版画
StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us
Run Code Online (Sandbox Code Playgroud)
打开文件的成本约为8毫秒.由于文件太小,您的缓存可能会将性能提高2-5倍.即使如此,它将花费大约10个小时打开文件.使用split vs StringTokenizer的成本远低于0.01 ms.解析1900万x 30个单词*每个单词8个字母大约需要10秒钟(每2秒约1 GB)
如果你想提高性能,我建议你有更少的文件.例如,使用数据库.如果您不想使用SQL数据库,我建议使用其中一个http://nosql-database.org/
Java API规范建议使用split.请参阅文档StringTokenizer.
另一件重要的事情,据我所知没有记录,是要求 StringTokenizer 返回分隔符以及标记化的字符串(通过使用构造函数StringTokenizer(String str, String delim, boolean returnDelims))也减少了处理时间。因此,如果您正在寻找性能,我建议您使用以下内容:
private static final String DELIM = "#";
public void splitIt(String input) {
StringTokenizer st = new StringTokenizer(input, DELIM, true);
while (st.hasMoreTokens()) {
String next = getNext(st);
System.out.println(next);
}
}
private String getNext(StringTokenizer st){
String value = st.nextToken();
if (DELIM.equals(value))
value = null;
else if (st.hasMoreTokens())
st.nextToken();
return value;
}
Run Code Online (Sandbox Code Playgroud)
尽管 getNext() 方法引入了开销,它会为您丢弃分隔符,但根据我的基准测试,它仍然快 50%。
| 归档时间: |
|
| 查看次数: |
49027 次 |
| 最近记录: |