最近我不得不搜索一些字符串值,看看哪一个匹配某个模式.在用户输入搜索词之前,字符串值的数量和模式本身都不清楚.问题是每次我的应用程序运行以下行时都注意到:
if (stringValue.matches (rexExPattern))
{
// do something so simple
}
Run Code Online (Sandbox Code Playgroud)
它需要大约40微秒.无需说当字符串值的数量超过几千时,它就会太慢.
模式类似于:
"A*B*C*D*E*F*"
Run Code Online (Sandbox Code Playgroud)
其中A~F只是这里的例子,但模式有点像上面那样.请注意*每个搜索模式实际上都会发生变化.例如,"A*B*C*"可以改变为W*D*G*A*".
我想知道是否有更好的替代上述模式,或者更一般地说,替代java正则表达式.
See*_*ose 94
Java中的正则表达式被编译为内部数据结构.这个编译是一个耗时的过程.每次调用该方法时String.matches(String regex),都会再次编译指定的正则表达式.
所以你应该只编译一次正则表达式并重用它:
Pattern pattern = Pattern.compile(regexPattern);
for(String value : values) {
Matcher matcher = pattern.matcher(value);
if (matcher.matches()) {
// your code here
}
}
Run Code Online (Sandbox Code Playgroud)
Jas*_*n C 29
考虑以下(快速和脏)测试:
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test3 {
// time that tick() was called
static long tickTime;
// called at start of operation, for timing
static void tick () {
tickTime = System.nanoTime();
}
// called at end of operation, prints message and time since tick().
static void tock (String action) {
long mstime = (System.nanoTime() - tickTime) / 1000000;
System.out.println(action + ": " + mstime + "ms");
}
// generate random strings of form AAAABBBCCCCC; a random
// number of characters each randomly repeated.
static List<String> generateData (int itemCount) {
Random random = new Random();
List<String> items = new ArrayList<String>();
long mean = 0;
for (int n = 0; n < itemCount; ++ n) {
StringBuilder s = new StringBuilder();
int characters = random.nextInt(7) + 1;
for (int k = 0; k < characters; ++ k) {
char c = (char)(random.nextInt('Z' - 'A') + 'A');
int rep = random.nextInt(95) + 5;
for (int j = 0; j < rep; ++ j)
s.append(c);
mean += rep;
}
items.add(s.toString());
}
mean /= itemCount;
System.out.println("generated data, average length: " + mean);
return items;
}
// match all strings in items to regexStr, do not precompile.
static void regexTestUncompiled (List<String> items, String regexStr) {
tick();
int matched = 0, unmatched = 0;
for (String item:items) {
if (item.matches(regexStr))
++ matched;
else
++ unmatched;
}
tock("uncompiled: regex=" + regexStr + " matched=" + matched +
" unmatched=" + unmatched);
}
// match all strings in items to regexStr, precompile.
static void regexTestCompiled (List<String> items, String regexStr) {
tick();
Matcher matcher = Pattern.compile(regexStr).matcher("");
int matched = 0, unmatched = 0;
for (String item:items) {
if (matcher.reset(item).matches())
++ matched;
else
++ unmatched;
}
tock("compiled: regex=" + regexStr + " matched=" + matched +
" unmatched=" + unmatched);
}
// test all strings in items against regexStr.
static void regexTest (List<String> items, String regexStr) {
regexTestUncompiled(items, regexStr);
regexTestCompiled(items, regexStr);
}
// generate data and run some basic tests
public static void main (String[] args) {
List<String> items = generateData(1000000);
regexTest(items, "A*");
regexTest(items, "A*B*C*");
regexTest(items, "E*C*W*F*");
}
}
Run Code Online (Sandbox Code Playgroud)
字符串是1-8个字符的随机序列,每个字符连续出现5-100次(例如"AAAAAAGGGGGDDFFFFFF").我根据你的表情猜测.
当然,这可能无法代表您的数据集,但是在我的适度2.3 GHz双核i5上,将这些正则表达式应用于100万的随机生成的时序估计随机生成平均长度为208的字符串:
Regex Uncompiled Precompiled
A* 0.564 sec 0.126 sec
A*B*C* 1.768 sec 0.238 sec
E*C*W*F* 0.795 sec 0.275 sec
Run Code Online (Sandbox Code Playgroud)
实际产量:
generated data, average length: 208
uncompiled: regex=A* matched=6004 unmatched=993996: 564ms
compiled: regex=A* matched=6004 unmatched=993996: 126ms
uncompiled: regex=A*B*C* matched=18677 unmatched=981323: 1768ms
compiled: regex=A*B*C* matched=18677 unmatched=981323: 238ms
uncompiled: regex=E*C*W*F* matched=25495 unmatched=974505: 795ms
compiled: regex=E*C*W*F* matched=25495 unmatched=974505: 275ms
Run Code Online (Sandbox Code Playgroud)
即使没有预编译表达式的加速,甚至考虑到结果根据数据集和正则表达式而变化很大(甚至考虑到我打破了正确的Java性能测试的基本规则并且忘了首先填充HotSpot),这非常快,我仍然想知道瓶颈是否真的在你认为的位置.
切换到预编译表达式后,如果仍然无法满足实际性能要求,请进行一些分析.如果您发现瓶颈仍在搜索中,请考虑实施更优化的搜索算法.
例如,假设您的数据集与我上面的测试集类似:如果您的数据集是提前知道的,则通过删除重复字符将其中的每个项目减少为较小的字符串键,例如"AAAAAAABBBBCCCCCCC",将其存储在地图中由"ABC"键入的某种类型.当用户搜索"A B C*"(假设您的正则表达式采用该特定形式)时,请查找"ABC"项目.管他呢.这在很大程度上取决于您的情况
| 归档时间: |
|
| 查看次数: |
38790 次 |
| 最近记录: |