查找给定两个字符串的所有常见子字符串

Question

查找给定两个字符串的所有常见子字符串

我遇到了一个问题陈述,找到给定的两个子字符串之间的所有常见子字符串,这样在每种情况下都必须打印最长的子字符串.问题陈述如下:

编写程序以查找两个给定字符串之间的公共子字符串.但是,不要包含较长公共子字符串中包含的子字符串.

例如,给定的输入串eatsleepnightxyz并eatsleepabcxyz,结果应该是:

eatsleep(由于eatsleepnightxyz eatsleepabcxyz)

xyz(由于)eatsleepnightxyz eatsleepabcxyz

a(由于)eatsleepnightxyz eatsleepabcxyz

t(由于)eatsleepnightxyz eatsleepabcxyz

但是,结果集应不包括e从 ,因为这两个s的已经包含在上面提到的.你也应该包括,,等,因为这些也都涵盖.eatsleepnightxyz eatsleepabcxyzeeatsleepeaeatatseatsleep

在这里,您不必使用String实用程序方法,如:contains,indexOf,StringTokenizer,split和replace.

我的算法如下:我从蛮力开始,当我提高基本理解时,将切换到更优化的解决方案.

 For String S1:
     Find all the substrings of S1 of all the lengths
     While doing so: Check if it is also a substring of 
     S2.

Run Code Online (Sandbox Code Playgroud)

试图找出我的方法的时间复杂性.

让两个给定的字符串为n1-String和n2-String

S1的子串数明显为n1(n1 + 1)/ 2.
但我们必须找到S1的子串的平均长度.
我们说它是m.我们将分别找到m.
时间复杂度检查m-String是否是n-String的子串是O(n*m).
现在,我们正在检查每个m-String是S2的子串,它是一个n2-String.
如上所述,这是一种O(n ² m)算法.
那么整个算法所需的时间是
Tn =(S1中的子串数)*(字符比较过程的平均子串长度时间)
通过执行某些计算,我得出结论,时间复杂度为O(n ³ m ²)
现在,我们的工作是在n1方面找到m.

尝试根据n1找到m.

T _n =(n)(1)+(n-1)(2)+(n-2)(3)+ ..... +(2)(n-1)+(1)(n)
其中T _n是所有子串的长度之和.

平均值是该总和除以生成的子串总数.

这只是一个求和和除法问题,其解决方案如下O(n)

因此...

我算法的运行时间是O(n ^ 5).

考虑到这一点,我写了以下代码:

 package pack.common.substrings;

 import java.util.ArrayList;
 import java.util.LinkedHashSet;
 import java.util.List;
 import java.util.Set;

 public class FindCommon2 {
    public static final Set<String> commonSubstrings = new      LinkedHashSet<String>();

 public static void main(String[] args) {
    printCommonSubstrings("neerajisgreat", "neerajisnotgreat");
    System.out.println(commonSubstrings);
}

 public static void printCommonSubstrings(String s1, String s2) {
    for (int i = 0; i < s1.length();) {
        List<String> list = new ArrayList<String>();
        for (int j = i; j < s1.length(); j++) {
            String subStr = s1.substring(i, j + 1);
            if (isSubstring(subStr, s2)) {
                list.add(subStr);
            }
        }
        if (!list.isEmpty()) {
            String s = list.get(list.size() - 1);
            commonSubstrings.add(s);
            i += s.length();
        }
    }
 }

 public static boolean isSubstring(String s1, String s2) {
    boolean isSubstring = true;
    int strLen = s2.length();
    int strToCheckLen = s1.length();
    if (strToCheckLen > strLen) {
        isSubstring = false;
    } else {
        for (int i = 0; i <= (strLen - strToCheckLen); i++) {
            int index = i;
            int startingIndex = i;
            for (int j = 0; j < strToCheckLen; j++) {
                if (!(s1.charAt(j) == s2.charAt(index))) {
                    break;
                } else {
                    index++;
                }
            }
            if ((index - startingIndex) < strToCheckLen) {
                isSubstring = false;
            } else {
                isSubstring = true;
                break;
            }
        }
    }
    return isSubstring;
 }
}

Run Code Online (Sandbox Code Playgroud)

我的代码说明:

 printCommonSubstrings: Finds all the substrings of S1 and 
                        checks if it is also a substring of 
                        S2.
 isSubstring : As the name suggests, it checks if the given string 
               is a substring of the other string.

Run Code Online (Sandbox Code Playgroud)

问题:鉴于投入

  S1 = “neerajisgreat”;
  S2 = “neerajisnotgreat”
  S3 = “rajeatneerajisnotgreat”

Run Code Online (Sandbox Code Playgroud)

在S1和S2的情况下,输出应该是:neerajis和great ,但在S1和S3的情况下,输出应该是: neerajis,raj,great,eat但还是我得到neerajis和great作为输出.我需要弄明白这一点.

我应该如何设计我的代码？

Answer 1

200*_*ess 18

使用适当的算法算法而不是蛮力方法会更好.维基百科描述了最常见的子串问题的两种常见解决方案:后缀树和动态编程.

动态编程解决方案需要O(nm)时间和O(nm)空间.对于最长的公共子字符串,这几乎是对Wikipedia伪代码的直接Java翻译:

public static Set<String> longestCommonSubstrings(String s, String t) {
    int[][] table = new int[s.length()][t.length()];
    int longest = 0;
    Set<String> result = new HashSet<>();

    for (int i = 0; i < s.length(); i++) {
        for (int j = 0; j < t.length(); j++) {
            if (s.charAt(i) != t.charAt(j)) {
                continue;
            }

            table[i][j] = (i == 0 || j == 0) ? 1
                                             : 1 + table[i - 1][j - 1];
            if (table[i][j] > longest) {
                longest = table[i][j];
                result.clear();
            }
            if (table[i][j] == longest) {
                result.add(s.substring(i - longest + 1, i + 1));
            }
        }
    }
    return result;
}

Run Code Online (Sandbox Code Playgroud)

现在,您需要所有常见的子串,而不仅仅是最长的子串.您可以增强此算法以包含更短的结果.让我们检查表格中的示例输入eatsleepnightxyz和eatsleepabcxyz:

  e a t s l e e p a b c x y z
e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 3

Run Code Online (Sandbox Code Playgroud)

该eatsleep结果是显而易见的:那就是12345678在左上角的斜条纹.
该xyz结果是123在右下角的对角线.
该a结果由表示1靠近顶部(第二行,第九列).
该t结果是由指定的1左下角附近.

那其他1的左侧,顶部和旁边的S 6和7？那些不计算因为它们出现在由12345678对角线形成的矩形内- 换句话说,它们已被覆盖eatsleep.

我建议做一个通行证,除了建立表格.然后,进行第二次传递,从右下角向后迭代,以收集结果集.

Answer 2

ktb*_*biz 5

通常,这种类型的子串匹配是在称为Trie(发音为try)的单独数据结构的帮助下完成的.最适合此问题的特定变体是后缀树.您的第一步应该是获取输入并构建后缀树.然后你需要使用后缀树来确定最长的公共子串,这是一个很好的练习.

归档时间：	10 年，1 月前
查看次数：	15120 次
最近记录：	10 年，1 月前