搜索字符串中未知模式的最有效方法是什么?

Ale*_*iam 17 java algorithm substring

我试图找到以下模式:

  • 发生不止一次
  • 长度超过1个字符
  • 不是任何其他已知模式的子串

不知道可能发生的任何模式.

例如:

  • 字符串"男孩倒在钟楼"将返回'ell', 'the b', 'y '.
  • 字符串"男孩倒在钟楼上,男孩倒在钟楼旁边"会回来'the boy fell by the bell'.

使用双for循环,它可以非常低效地强制使用:

ArrayList<String> patternsList = new ArrayList<>();
int length = string.length();
for (int i = 0; i < length; i++) {
    int limit = (length - i) / 2;
    for (int j = limit; j >= 1; j--) {
        int candidateEndIndex = i + j;
        String candidate = string.substring(i, candidateEndIndex);

        if(candidate.length() <= 1) {
            continue;
        }

        if (string.substring(candidateEndIndex).contains(candidate)) {
            boolean notASubpattern = true;
            for (String pattern : patternsList) {
                if (pattern.contains(candidate)) {
                    notASubpattern = false;
                    break;
                }
            }

            if (notASubpattern) {
                patternsList.add(candidate);
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

然而,当搜索大量模式的大字符串时,这是非常慢的.

Mat*_*ans 24

您可以在线性时间内为字符串构建后缀树:https: //en.wikipedia.org/wiki/Suffix_tree

您要查找的模式是与仅具有叶子元素的内部节点对应的字符串.


J. *_*rth 8

您可以使用n-gram来查找字符串中的模式.扫描字符串n-gram需要O(n)时间.当您使用n-gram找到子字符串时,将其放入哈希表中,并计算字符串中找到子字符串的次数.当您在字符串中搜索n-gram时,在哈希表中搜索大于1的计数,以查找字符串中的重复模式.

例如,在字符串"男孩摔倒钟,男孩摔倒钟"使用6克将发现子串"男孩摔倒了钟".具有该子字符串的哈希表条目的计数为2,因为它在字符串中出现了两次.改变n-gram中的单词数将帮助您发现字符串中的不同模式.

Dictionary<string, int>dict = new Dictionary<string, int>();
int count = 0;
int ngramcount = 6;
string substring = "";

// Add entries to the hash table
while (count < str.length) {
    // copy the words into the substring
    int i = 0;
    substring = "";
    while (ngramcount > 0 && count < str.length) {
        substring[i] = str[count];
        if (str[i] == ' ')
            ngramcount--;
        i++;
        count++;
    }
    ngramcount = 6;
    substring.Trim();  // get rid of the last blank in the substring
    // Update the dictionary (hash table) with the substring
    if (dict.Contains(substring)) {  // substring is already in hash table so increment the count
        int hashCount = dict[substring];
        hashCount++;
        dict[substring] = hashCount;
    }
    else
        dict[substring] = 1;
}

// Find the most commonly occurrring pattern in the string
// by searching the hash table for the greatest count.
int maxCount = 0;
string mostCommonPattern = "";
foreach (KeyValuePair<string, int> pair in dict) {
    if (pair.Value > maxCount) {
        maxCount = pair.Value;
        mostCommonPattern = pair.Key;
    }
}
Run Code Online (Sandbox Code Playgroud)