Sau*_*ari 9 c++ string algorithm suffix-tree longest-prefix
Problem Description:
References: Fun With Strings
Based on the problem description, a naive approach to find sum of length of LCP for all possible substrings (for a given string) is as follows :
#include <cstring>
#include <iostream>
using std::cout;
using std::cin;
using std::endl;
using std::string;
int lcp(string str1, string str2)
{
string result;
int n1 = str1.length(), n2 = str2.length();
// Compare str1 and str2
for (int i=0, j=0; i<=n1-1 && j<=n2-1; i++,j++)
{
if (str1[i] != str2[j])
break;
result.push_back(str1[i]);
}
return (result.length());
}
int main()
{
string s;
cin>>s;
int sum = 0;
for(int i = 0; i < s.length(); i++)
for(int j = i; j < s.length(); j++)
for(int k = 0; k < s.length(); k++)
for(int l = k; l < s.length(); l++)
sum += lcp(s.substr(i,j - i + 1),s.substr(k,l - k + 1));
cout<<sum<<endl;
return 0;
}
Run Code Online (Sandbox Code Playgroud)
Based on further reading and research on LCP's, I found this document which specifies a way to efficiently find a LCP using an Advanced Data Structure called Tries. I implemented a Trie and a Compressed Trie (Suffix Tree) as follows:
#include <iostream>
#include <cstring>
using std::cout;
using std::cin;
using std::endl;
using std::string;
const int ALPHA_SIZE = 26;
struct TrieNode
{
struct TrieNode *children[ALPHA_SIZE];
string label;
bool isEndOfWord;
};
typedef struct TrieNode Trie;
Trie *getNode(void)
{
Trie *parent = new Trie;
parent->isEndOfWord = false;
parent->label = "";
for(int i = 0; i <ALPHA_SIZE; i++)
parent->children[i] = NULL;
return parent;
}
void insert(Trie *root, string key)
{
Trie *temp = root;
for(int i = 0; i < key.length(); i++)
{
int index = key[i] - 'a';
if(!temp->children[index])
{
temp->children[index] = getNode();
temp->children[index]->label = key[i];
}
temp = temp->children[index];
temp->isEndOfWord = false;
}
temp->isEndOfWord = true;
}
int countChildren(Trie *node, int *index)
{
int count = 0;
for(int i = 0; i < ALPHA_SIZE; i++)
{
if(node->children[i] != NULL)
{
count++;
*index = i;
}
}
return count;
}
void display(Trie *root)
{
Trie *temp = root;
for(int i = 0; i < ALPHA_SIZE; i++)
{
if(temp->children[i] != NULL)
{
cout<<temp->label<<"->"<<temp->children[i]->label<<endl;
if(!temp->isEndOfWord)
display(temp->children[i]);
}
}
}
void compress(Trie *root)
{
Trie *temp = root;
int index = 0;
for(int i = 0; i < ALPHA_SIZE; i++)
{
if(temp->children[i])
{
Trie *child = temp->children[i];
if(!child->isEndOfWord)
{
if(countChildren(child,&index) >= 2)
{
compress(child);
}
else if(countChildren(child,&index) == 1)
{
while(countChildren(child,&index) < 2 and countChildren(child,&index) > 0)
{
Trie *sub_child = child->children[index];
child->label = child->label + sub_child->label;
child->isEndOfWord = sub_child->isEndOfWord;
memcpy(child->children,sub_child->children,sizeof(sub_child->children));
delete(sub_child);
}
compress(child);
}
}
}
}
}
bool search(Trie *root, string key)
{
Trie *temp = root;
for(int i = 0; i < key.length(); i++)
{
int index = key[i] - 'a';
if(!temp->children[index])
return false;
temp = temp->children[index];
}
return (temp != NULL && temp->isEndOfWord);
}
int main()
{
string input;
cin>>input;
Trie *root = getNode();
for(int i = 0; i < input.length(); i++)
for(int j = i; j < input.length(); j++)
{
cout<<"Substring : "<<input.substr(i,j - i + 1)<<endl;
insert(root, input.substr(i,j - i + 1));
}
cout<<"DISPLAY"<<endl;
display(root);
compress(root);
cout<<"AFTER COMPRESSION"<<endl;
display(root);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
My question, is how do I proceed to find the length of the LCP. I can get the LCP by getting the label field at the branching node, but how do I count the length of LCP's for all possible substrings ?
One way I thought of was to some how use the branching node, its label field which holds the LCP, and the branching node's children to find sum of all LCP's length (Lowest Common Ancestor ?). But I am still confused. How do I proceed further ?
Note: It is also possible that my approach to this problem is wrong, so please suggest other methods too for this problem (considering time and space complexity).
Link to similar unanswered questions:
References for Code and Theory:
Update1:
Based on @Adarsh Anurag's answer, I have come up with following implementation with the help of trie data structure,
#include <iostream>
#include <cstring>
#include <stack>
using std::cout;
using std::cin;
using std::endl;
using std::string;
using std::stack;
const int ALPHA_SIZE = 26;
int sum = 0;
stack <int> lcp;
struct TrieNode
{
struct TrieNode *children[ALPHA_SIZE];
string label;
int count;
};
typedef struct TrieNode Trie;
Trie *getNode(void)
{
Trie *parent = new Trie;
parent->count = 0;
parent->label = "";
for(int i = 0; i <ALPHA_SIZE; i++)
parent->children[i] = NULL;
return parent;
}
void insert(Trie *root, string key)
{
Trie *temp = root;
for(int i = 0; i < key.length(); i++)
{
int index = key[i] - 'a';
if(!temp->children[index])
{
temp->children[index] = getNode();
temp->children[index]->label = key[i];
}
temp = temp->children[index];
}
temp->count++;
}
int countChildren(Trie *node, int *index)
{
int count = 0;
for(int i = 0; i < ALPHA_SIZE; i++)
{
if(node->children[i] != NULL)
{
count++;
*index = i;
}
}
return count;
}
void display(Trie *root)
{
Trie *temp = root;
int index = 0;
for(int i = 0; i < ALPHA_SIZE; i++)
{
if(temp->children[i] != NULL)
{
cout<<temp->label<<"->"<<temp->children[i]->label<<endl;
cout<<"CountOfChildren:"<<countChildren(temp,&index)<<endl;
cout<<"Counter:"<<temp->children[i]->count<<endl;
display(temp->children[i]);
}
}
}
void lcp_sum(Trie *root,int counter,string lcp_label)
{
Trie *temp = root;
int index = 0;
for(int i = 0; i < ALPHA_SIZE; i++)
{
if(temp->children[i])
{
Trie *child = temp->children[i];
if(lcp.empty())
{
lcp_label = child->label;
counter = 0;
lcp.push(child->count*lcp_label.length());
sum += lcp.top();
counter += 1;
}
else
{
lcp_label = lcp_label + child->label;
stack <int> temp = lcp;
while(!temp.empty())
{
sum = sum + 2 * temp.top() * child->count;
temp.pop();
}
lcp.push(child->count*lcp_label.length());
sum += lcp.top();
counter += 1;
}
if(countChildren(child,&index) > 1)
{
lcp_sum(child,0,lcp_label);
}
else if (countChildren(child,&index) == 1)
lcp_sum(child,counter,lcp_label);
else
{
while(counter-- && !lcp.empty())
lcp.pop();
}
}
}
}
int main()
{
string input;
cin>>input;
Trie *root = getNode();
for(int i = 0; i < input.length(); i++)
for(int j = i; j < input.length(); j++)
{
cout<<"Substring : "<<input.substr(i,j - i + 1)<<endl;
insert(root, input.substr(i,j - i + 1));
display(root);
}
cout<<"DISPLAY"<<endl;
display(root);
cout<<"COUNT"<<endl;
lcp_sum(root,0,"");
cout<<sum<<endl;
return 0;
}
Run Code Online (Sandbox Code Playgroud)
From the Trie structure, I have removed the variable isEndOfWord
and instead replaced it with a counter
. This variable keeps track of duplicate substrings which should help in counting LCP's for strings with duplicate characters. However, the above implementation works only for strings with distinct characters. I have tried implementing the method suggested by @Adarsh for duplicate characters but does not satisfy any test case.
Update2:
Based on further updated answer from @Adarsh and "trial and error" with different testcases, I seem to have progressed a bit for duplicate characters, however it still does not work as expected. Here is the implementation with comments,
// LCP : Longest Common Prefix
// DFS : Depth First Search
#include <iostream>
#include <cstring>
#include <stack>
#include <queue>
using std::cout;
using std::cin;
using std::endl;
using std::string;
using std::stack;
using std::queue;
const int ALPHA_SIZE = 26;
int sum = 0; // Global variable for LCP sum
stack <int> lcp; //Keeps track of current LCP
// Trie Data Structure Implementation (See References Section)
struct TrieNode
{
struct TrieNode *children[ALPHA_SIZE]; // Search space can be further reduced by keeping track of required indicies
string label;
int count; // Keeps track of repeat substrings
};
typedef struct TrieNode Trie;
Trie *getNode(void)
{
Trie *parent = new Trie;
parent->count = 0;
parent->label = ""; // Root Label at level 0 is an empty string
for(int i = 0; i <ALPHA_SIZE; i++)
parent->children[i] = NULL;
return parent;
}
void insert(Trie *root, string key)
{
Trie *temp = root;
for(int i = 0; i < key.length(); i++)
{
int index = key[i] - 'a'; // Lowercase alphabets only
if(!temp->children[index])
{
temp->children[index] = getNode();
temp->children[index]->label = key[i]; // Label represents the character being inserted into the node
}
temp = temp->children[index];
}
temp->count++;
}
// Returns the count of child nodes for a given node
int countChildren(Trie *node, int *index)
{
int count = 0;
for(int i = 0; i < ALPHA_SIZE; i++)
{
if(node->children[i] != NULL)
{
count++;
*index = i; //Not required for this problem, used in compressed trie implementation
}
}
return count;
}
// Displays the Trie in DFS manner
void display(Trie *root)
{
Trie *temp = root;
int index = 0;
for(int i = 0; i < ALPHA_SIZE; i++)
{
if(temp->children[i] != NULL)
{
cout<<temp->label<<"->"<<temp->children[i]->label<<endl; // Display in this format : Root->Child
cout<<"CountOfChildren:"<<countChildren(temp,&index)<<endl; // Count of Child nodes for Root
cout<<"Counter:"<<temp->children[i]->count<<endl; // Count of repeat substrings for a given node
display(temp->children[i]);
}
}
}
/* COMPRESSED TRIE IMPLEMENTATION
void compress(Trie *root)
{
Trie *temp = root;
int index = 0;
for(int i = 0; i < ALPHA_SIZE; i++)
{
if(temp->children[i])
{
Trie *child = temp->children[i];
//if(!child->isEndOfWord)
{
if(countChildren(child,&index) >= 2)
{
compress(child);
}
else if(countChildren(child,&index) == 1)
{
while(countChildren(child,&index) < 2 and countChildren(child,&index) > 0)
{
Trie *sub_child = child->children[index];
child->label = child->label + sub_child->label;
//child->isEndOfWord = sub_child->isEndOfWord;
memcpy(child->children,sub_child->children,sizeof(sub_child->children));
delete(sub_child);
}
compress(child);
}
}
}
}
}
*/
// Calculate LCP Sum recursively
void lcp_sum(Trie *root,int *counter,string lcp_label,queue <int> *s_count)
{
Trie *temp = root;
int index = 0;
// Traverse through this root's children array, to find child nodes
for(int i = 0; i < ALPHA_SIZE; i++)
{
// If child nodes found, then ...
if(temp->children[i] != NULL)
{
Trie *child = temp->children[i];
// Check if LCP stack is empty
if(lcp.empty())
{
lcp_label = child->label; // Set LCP label as Child's label
*counter = 0; // To make sure counter is not -1 during recursion
/*
* To include LCP of repeat substrings, multiply the count variable with current LCP Label's length
* Push this to a stack called lcp
*/
lcp.push(child->count*lcp_label.length());
// Add LCP for (a,a)
sum += lcp.top() * child->count; // Formula to calculate sum for repeat substrings : (child->count) ^ 2 * LCP Label's Length
*counter += 1; // Increment counter, this is used further to pop elements from the stack lcp, when a branching node is encountered
}
else
{
lcp_label = lcp_label + child->label; // If not empty, then add Child's label to LCP label
stack <int> temp = lcp; // Temporary Stack
/*
To calculate LCP for different combinations of substrings,
2 -> accounts for (a,b) and (b,a)
temp->top() -> For previous substrings and their combinations with the current substring
child->count() -> For any repeat substrings for current node/substring
*/
while(!temp.empty())
{
sum = sum + 2 * temp.top() * child->count;
temp.pop();
}
// Similar to above explanation for if block
lcp.push(child->count*lcp_label.length());
sum += lcp.top() * child->count;
*counter += 1;
}
// If a branching node is encountered
if(countChildren(child,&index) > 1)
{
int lc = 0; // dummy variable
queue <int> ss_count; // queue to keep track of substrings (counter) from the child node of the branching node
lcp_sum(child,&lc,lcp_label,&ss_count); // Recursively calculate LCP for child node
// This part is experimental, does not work for all testcases
// Used to calculate the LCP count for substrings between child nodes of the branching node
if(countChildren(child,&index) == 2)
{
int counter_queue = ss_count.front();
ss_count.pop();
while(counter_queue--)
{
sum = sum + 2 * ss_count.front() * lcp_label.length();
ss_count.pop();
}
}
else
{
// Unclear, what happens if children is > 3
// Should one take combination of each child node with one another ?
while(!ss_count.empty())
{
sum = sum + 2 * ss_count.front() * lcp_label.length();
ss_count.pop();
}
}
lcp_label = temp->label; // Set LCP label back to Root's Label
// Empty the stack till counter is 0, so as to restore it's state when it first entered the child node from the branching node
while(*counter)
{
lcp.pop();
*counter -=1;
}
continue; // Continue to next child of the branching node
}
else if (countChildren(child,&index) == 1)
{
// If count of children is 1, then recursively calculate LCP for further child node
lcp_sum(child,counter,lcp_label,s_count);
}
else
{
// If count of child nodes is 0, then push the counter to the queue for that node
s_count->push(*counter);
// Empty the stack till counter is 0, so as to restore it's state when it first entered the child node from the branching node
while(*counter)
{
lcp.pop();
*counter -=1;
}
lcp_label = temp->label; // Set LCP label back to Root's Label
}
}
}
}
/* SEARCHING A TRIE
bool search(Trie *root, string key)
{
Trie *temp = root;
for(int
您的直觉正在朝着正确的方向发展。
基本上,每当看到子字符串的LCP问题时,都应该考虑后缀数据结构,例如后缀树,后缀数组和后缀自动机。后缀树可以说是功能最强大,最容易处理的树,它们可以很好地解决此问题。
后缀树是一个Trie,包含字符串的所有满足,每个非分支边缘链都压缩为一个长边。具有所有条件的普通trie的问题在于它具有O(N ^ 2)个节点,因此需要O(N ^ 2)个内存。假设您可以使用简单的动态编程预先计算O(N ^ 2)时间和空间中所有对的LCP,则后缀树没有压缩就不好了。压缩的特里占用O(N)内存,但是如果您使用O(N ^ 2)算法构建它,则仍然无效(就像在代码中一样)。您应该使用Ukkonen算法,以直接在O(n)的时间压缩形式构建后缀树。学习和实现此算法并非易事,也许您会发现Web可视化有帮助的。最后一点,为简单起见,我假设在字符串的末尾添加了定点字符(例如dollar $),以确保所有叶子都是后缀树中的显式节点。
注意:
这是主要思想。考虑所有成对的子字符串,并将它们分为具有相同LCA顶点的组。换句话说,让我们计算A [v]:= LCA顶点恰好是v的子串对的数量。如果为每个顶点v计算此数字,则剩下的要解决的问题是:将每个数字乘以节点的深度并得到总和。同样,数组A [*]仅占用O(N)空间,这意味着我们还没有失去在线性时间内解决整个问题的机会。
回想一下,每个子字符串都是根节点路径。考虑两个节点(代表两个任意子串)和一个顶点v。让我们将在顶点v处具有根的子树称为“ v-子树”。然后:
让我们介绍另一个数量B [v]:= LCA顶点在v -subtree 内的子串对的数量。上面的语句显示了一种计算B [v]的有效方法:它只是v-subtree中节点数的平方,因为其中的每对节点都符合标准。但是,此处应考虑多重性,因此每个节点必须计入与其对应的子字符串一样多的次数。
公式如下:
B[v] = Q[v]^2
Q[v] = sum_s( Q[s] + M[s] * len(vs) ) for s in sons(v)
M[v] = sum_s( M[s] ) for s in sons(v)
Run Code Online (Sandbox Code Playgroud)
M [v]是顶点的多重性(即v子树中存在多少个叶子),而Q [v]是v子树中考虑了多重性的节点数。当然,您可以自己推断出叶子的基本情况。使用这些公式,您可以在O(N)时间内遍历树时计算M [*],Q [*],B [*]。
仅需使用B [*]数组来计算A [*]数组。可以通过简单的排除公式在O(N)中完成:
A[v] = B[v] - sum_s( B[s] ) for s in sons(v)
Run Code Online (Sandbox Code Playgroud)
如果实现所有这些,您将能够在完美的O(N)时间和空间上解决整个问题。或者说更好:O(NC)时空,其中C是字母的大小。
要解决该问题,请按如下所示进行。
由于添加了所有子字符串,因此 trie 中的每个节点的 endOfWord 都为 true。
现在开始和我一起以 DFS 方式遍历树:
总和 = 0,堆栈 = {空}
我们a
先相遇。现在 L(A,B)a
可以与其自身形成 1 对。因此现在sum=sum+length
和sum
变为 1。现在将长度即 1 压入堆栈。堆栈 = {1}
搬到b
现在。子串现在是ab
. ab
likea
可以与自身形成一对。因此现在做sum=sum+length
并且sum
变成3。将堆栈内容复制到 stack2。我们得到 1 作为栈顶。这意味着LCPab
为a
1。但它们可以形成 L(a,ab) 和 L(ab,a)。所以添加 sum = sum + 2 * stack.top() 。Sum 变为 3+2 = 5。现在将 stack2 复制回 stack 并压入长度,即 2。stack 变为 {2,1}。
搬去c
。子串是abc
. 它将与自身形成 1 对,因此加 3。总和变为 5+3 = 8。将 stack 复制到 stack2。在顶部,我们有 2 abc
,并将ab
给出 LCP 2,它们将形成 2 对。所以总和=总和+2*2。总和变为 12。弹出 2。堆栈现在有 1。abc
并且a
具有LCP 1并且可以形成2对。因此,总和变为 12+2 = 14。将 stack2 复制回堆栈并将 length 即 3 压入堆栈。
我们到达了 trie 的末尾。清空堆栈并从b
长度为 1 的位置开始并按上述方式继续。这里总和变成 14+1 = 15
我们到达c
。子串在bc
这里。总和将变为 15 + 2 + 2*1(顶部) = 19。
我们到达了 trie 的末尾。c
从长度 1开始。现在总和 = 19+1 = 20。
时间复杂度:O(N^3)。因为生成子字符串需要 O(N^2) 时间,将它们插入到 trie 中需要 O(N) 时间。节点创建是恒定时间。由于所有子串的长度都不为 N,因此所需时间小于 N^3,但 TC 将为 O(N^3)。
我已经测试了这个方法,它只为具有不同字符的单词提供正确的输出。
对于允许重复字符的单词,它会失败。为了求解允许字符重复的单词,您需要存储有关 L(A,B) 的位置 A 和 B 处单词出现次数的信息。在堆栈中,我们需要推送长度和 B_count 对。然后,您可以使用当前子字符串的长度(在堆栈中)* B_count(在堆栈中)* A_count 求LCP之和。我不知道有什么方法可以在不使用 4 个循环的情况下找到 A、B 计数。
请参阅下面的图片以了解单词abb
就这样。谢谢。