ZZ *_*der 16 language-agnostic string algorithm
我们需要通过串联在数据库中组合3列.但是,3列可能包含重叠部分,不应复制部分.例如,
"a" + "b" + "c" => "abc"
"abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
"abcdede" + "dedefgh" + "" => "abcdedefgh"
"abcde" + "d" + "ghlmn" => "abcdedghlmn"
"abcdef" + "" + "defghl" => "abcdefghl"
Run Code Online (Sandbox Code Playgroud)
我们当前的算法非常慢,因为它使用强力来识别2个字符串之间的重叠部分.有没有人知道这样做的有效算法?
假设我们有2个字符串A和B.算法需要找到最长的公共子串S,这样A以S结尾,B以S开头.
我们目前在Java中使用的强力实现作为参考,
public static String concat(String s1, String s2) {
if (s1 == null)
return s2;
if (s2 == null)
return s1;
int len = Math.min(s1.length(), s2.length());
// Find the index for the end of overlapping part
int index = -1;
for (int i = len; i > 0; i--) {
String substring = s2.substring(0, i);
if (s1.endsWith(substring)) {
index = i;
break;
}
}
StringBuilder sb = new StringBuilder(s1);
if (index < 0)
sb.append(s2);
else if (index <= s2.length())
sb.append(s2.substring(index));
return sb.toString();
}
Run Code Online (Sandbox Code Playgroud)
Cra*_*ney 28
大多数其他答案都集中在恒定因子优化上,但也可以渐近地做得更好.看看你的算法:它是O(N ^ 2).这似乎是一个问题,可以解决得比这更快!
考虑一下Knuth Morris Pratt.它跟踪到目前为止我们匹配的最大子字符串数量.这意味着它知道S2在S2结束时已经匹配了多少,这就是我们正在寻找的价值!只需将算法修改为继续,而不是在早期匹配子字符串时返回,并让它返回匹配的数量而不是最后的0.
这给你一个O(n)算法.太好了!
int OverlappedStringLength(string s1, string s2) {
//Trim s1 so it isn't longer than s2
if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);
int[] T = ComputeBackTrackTable(s2); //O(n)
int m = 0;
int i = 0;
while (m + i < s1.Length) {
if (s2[i] == s1[m + i]) {
i += 1;
//<-- removed the return case here, because |s1| <= |s2|
} else {
m += i - T[i];
if (i > 0) i = T[i];
}
}
return i; //<-- changed the return here to return characters matched
}
int[] ComputeBackTrackTable(string s) {
var T = new int[s.Length];
int cnd = 0;
T[0] = -1;
T[1] = 0;
int pos = 2;
while (pos < s.Length) {
if (s[pos - 1] == s[cnd]) {
T[pos] = cnd + 1;
pos += 1;
cnd += 1;
} else if (cnd > 0) {
cnd = T[cnd];
} else {
T[pos] = 0;
pos += 1;
}
}
return T;
}
Run Code Online (Sandbox Code Playgroud)
OverlappedStringLength("abcdef","defghl")返回3