Java 中长字符串的正则表达式模式匹配性能

Question

Java 中长字符串的正则表达式模式匹配性能

我有一个正则表达式，在找到匹配项时效果很好（500 纳秒），但在没有匹配项时需要很长时间（超过 3 秒）。我怀疑这可能是因为回溯。我尝试了一些选项，例如根据某些文档进行转换.*，(.*)?但没有帮助。

输入：一个非常长的字符串 - 在某些情况下为 5k 个字符。

正则表达式匹配：.*substring1.*substring2.*

我正在预编译模式并重新使用匹配器，我还能尝试什么？

这是我的代码片段 - 我将使用数百万个不同的输入字符串调用此方法，但只有少数正则表达式模式。

private static HashMap<String, Pattern> patternMap = new HashMap<String, Pattern>();
private static HashMap<String, Matcher> matcherMap = new HashMap<String, Matcher>();

Run Code Online (Sandbox Code Playgroud)

这是我的方法：

public static Boolean regex_match(String line, String regex) {
    if (regex == null || line == null) {
      return null;
    }
    if (!patternMap.containsKey(regex)) {
      patternMap.put(regex, Pattern.compile(regex));
      matcherMap.put(regex,patternMap.get(regex).matcher(""));
    }
    return matcherMap.get(regex).reset(line).find(0);
 }

Run Code Online (Sandbox Code Playgroud)

Answer 1

Str*_*ids 5

正如您所暗示的那样，您的正则表达式会遇到称为灾难性回溯的问题。本质上，第一个.*将匹配整个字符串，然后回溯直到substring1匹配。这将与重复substring2。因为substring2失败了，第二个.*将需要找到另一个substring2开始匹配的地方，然后它会再次失败。每次substring1匹配时，我们都需要检查每个substring2可能匹配的地方。

您已经在使用pattern.find()，因此可以省略开头和结尾.*。然后，将内部更改.*为 a.*?可以通过将贪婪匹配器变为惰性匹配器来提高性能。

这会产生：substring1.*?substring2

归档时间：	9 年，10 月前
查看次数：	4903 次
最近记录：	9 年，10 月前