How to use a Trie data structure to find the sum of LCPs for all possible substrings?

Sau*_*ari 9 c++ string algorithm suffix-tree longest-prefix

Problem Description: 弦乐 References: Fun With Strings

Based on the problem description, a naive approach to find sum of length of LCP for all possible substrings (for a given string) is as follows :

#include <cstring>
#include <iostream>

using std::cout;
using std::cin;
using std::endl;
using std::string;

int lcp(string str1, string str2) 
{ 
    string result; 
    int n1 = str1.length(), n2 = str2.length(); 

    // Compare str1 and str2 
    for (int i=0, j=0; i<=n1-1 && j<=n2-1; i++,j++) 
    { 
        if (str1[i] != str2[j]) 
            break; 
        result.push_back(str1[i]); 
    } 

    return (result.length()); 
} 

int main()
{
    string s;
    cin>>s;
    int sum = 0;

    for(int i = 0; i < s.length(); i++)
        for(int j = i; j < s.length(); j++)
            for(int k = 0; k < s.length(); k++)
                for(int l = k; l < s.length(); l++)
                    sum += lcp(s.substr(i,j - i + 1),s.substr(k,l - k + 1));
    cout<<sum<<endl;     
    return 0;
}
Run Code Online (Sandbox Code Playgroud)

Based on further reading and research on LCP's, I found this document which specifies a way to efficiently find a LCP using an Advanced Data Structure called Tries. I implemented a Trie and a Compressed Trie (Suffix Tree) as follows:

#include <iostream>
#include <cstring>

using std::cout;
using std::cin;
using std::endl;
using std::string;
const int ALPHA_SIZE = 26;

struct TrieNode
{
    struct TrieNode *children[ALPHA_SIZE];
    string label;
    bool isEndOfWord;
};
typedef struct TrieNode Trie;

Trie *getNode(void)
{
    Trie *parent = new Trie;
    parent->isEndOfWord = false;
    parent->label = "";
    for(int i = 0; i <ALPHA_SIZE; i++)
        parent->children[i] = NULL;

    return parent;
}

void insert(Trie *root, string key)
{
    Trie *temp = root;

    for(int i = 0; i < key.length(); i++)
    {
        int index = key[i] - 'a';
        if(!temp->children[index])
        {
            temp->children[index] = getNode();
            temp->children[index]->label = key[i];
        }
        temp = temp->children[index];
        temp->isEndOfWord = false;
    }
    temp->isEndOfWord = true;
}

int countChildren(Trie *node, int *index)
{
    int count = 0;

    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        if(node->children[i] != NULL)
        {
            count++;
            *index = i;
        }
    }
    return count;
}

void display(Trie *root)
{
    Trie *temp = root;
    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        if(temp->children[i] != NULL)
        {
            cout<<temp->label<<"->"<<temp->children[i]->label<<endl;
            if(!temp->isEndOfWord)
                display(temp->children[i]);
        }
    }
}

void compress(Trie *root)
{
    Trie *temp = root;
    int index = 0;

    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        if(temp->children[i])
        {
            Trie *child = temp->children[i];

            if(!child->isEndOfWord)
            {
                if(countChildren(child,&index) >= 2)
                {
                    compress(child);
                }
                else if(countChildren(child,&index) == 1)
                {
                    while(countChildren(child,&index) < 2 and countChildren(child,&index) > 0)
                    {
                        Trie *sub_child = child->children[index];

                        child->label = child->label + sub_child->label;
                        child->isEndOfWord = sub_child->isEndOfWord;
                        memcpy(child->children,sub_child->children,sizeof(sub_child->children));

                        delete(sub_child);
                    }
                    compress(child);
                }
            }
        }
    }
}

bool search(Trie *root, string key)
{
    Trie *temp = root;

    for(int i = 0; i < key.length(); i++)
    {
        int index = key[i] - 'a';
        if(!temp->children[index])
            return false;
        temp = temp->children[index];
    }
    return (temp != NULL && temp->isEndOfWord);
}

int main()
{
    string input;
    cin>>input;

    Trie *root = getNode();

    for(int i = 0; i < input.length(); i++)
        for(int j = i; j < input.length(); j++)
        {
            cout<<"Substring : "<<input.substr(i,j - i + 1)<<endl;
            insert(root, input.substr(i,j - i + 1));
        }

    cout<<"DISPLAY"<<endl;
    display(root);

    compress(root);
    cout<<"AFTER COMPRESSION"<<endl;
    display(root);

    return 0;
}
Run Code Online (Sandbox Code Playgroud)

My question, is how do I proceed to find the length of the LCP. I can get the LCP by getting the label field at the branching node, but how do I count the length of LCP's for all possible substrings ?

One way I thought of was to some how use the branching node, its label field which holds the LCP, and the branching node's children to find sum of all LCP's length (Lowest Common Ancestor ?). But I am still confused. How do I proceed further ?

Note: It is also possible that my approach to this problem is wrong, so please suggest other methods too for this problem (considering time and space complexity).

Link to similar unanswered questions:

References for Code and Theory:

Update1:

Based on @Adarsh Anurag's answer, I have come up with following implementation with the help of trie data structure,

#include <iostream>
#include <cstring>
#include <stack>

using std::cout;
using std::cin;
using std::endl;
using std::string;
using std::stack;

const int ALPHA_SIZE = 26;
int sum = 0;
stack <int> lcp;

struct TrieNode
{
    struct TrieNode *children[ALPHA_SIZE];
    string label;
    int count;
};
typedef struct TrieNode Trie;

Trie *getNode(void)
{
    Trie *parent = new Trie;
    parent->count = 0;
    parent->label = "";
    for(int i = 0; i <ALPHA_SIZE; i++)
        parent->children[i] = NULL;

    return parent;
}

void insert(Trie *root, string key)
{
    Trie *temp = root;

    for(int i = 0; i < key.length(); i++)
    {
        int index = key[i] - 'a';
        if(!temp->children[index])
        {
            temp->children[index] = getNode();
            temp->children[index]->label = key[i];
        }
        temp = temp->children[index];
    }
    temp->count++;
}

int countChildren(Trie *node, int *index)
{
    int count = 0;

    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        if(node->children[i] != NULL)
        {
            count++;
            *index = i;
        }
    }
    return count;
}

void display(Trie *root)
{
    Trie *temp = root;
    int index = 0;
    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        if(temp->children[i] != NULL)
        {
            cout<<temp->label<<"->"<<temp->children[i]->label<<endl;
            cout<<"CountOfChildren:"<<countChildren(temp,&index)<<endl;
            cout<<"Counter:"<<temp->children[i]->count<<endl;

            display(temp->children[i]);
        }
    }
}

void lcp_sum(Trie *root,int counter,string lcp_label)
{
    Trie *temp = root;
    int index = 0;

    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        if(temp->children[i])
        {
            Trie *child = temp->children[i];

            if(lcp.empty())
            {
                lcp_label = child->label;
                counter = 0;

                lcp.push(child->count*lcp_label.length());
                sum += lcp.top();
                counter += 1;
            }
            else
            {
                lcp_label = lcp_label + child->label;
                stack <int> temp = lcp;

                while(!temp.empty())
                {
                    sum = sum + 2 * temp.top() * child->count;
                    temp.pop();
                }

                lcp.push(child->count*lcp_label.length());
                sum += lcp.top();
                counter += 1;
            }

            if(countChildren(child,&index) > 1)
            {
                lcp_sum(child,0,lcp_label);
            }
            else if (countChildren(child,&index) == 1)
                lcp_sum(child,counter,lcp_label);
            else
            {
                while(counter-- && !lcp.empty())
                    lcp.pop();
            }
        }
    }
}

int main()
{
    string input;
    cin>>input;

    Trie *root = getNode();

    for(int i = 0; i < input.length(); i++)
        for(int j = i; j < input.length(); j++)
        {
            cout<<"Substring : "<<input.substr(i,j - i + 1)<<endl;
            insert(root, input.substr(i,j - i + 1));
            display(root);
        }

    cout<<"DISPLAY"<<endl;
    display(root);

    cout<<"COUNT"<<endl;
    lcp_sum(root,0,"");
    cout<<sum<<endl;

    return 0;
}
Run Code Online (Sandbox Code Playgroud)

From the Trie structure, I have removed the variable isEndOfWordand instead replaced it with a counter. This variable keeps track of duplicate substrings which should help in counting LCP's for strings with duplicate characters. However, the above implementation works only for strings with distinct characters. I have tried implementing the method suggested by @Adarsh for duplicate characters but does not satisfy any test case.

Update2:

Based on further updated answer from @Adarsh and "trial and error" with different testcases, I seem to have progressed a bit for duplicate characters, however it still does not work as expected. Here is the implementation with comments,

// LCP : Longest Common Prefix
// DFS : Depth First Search

#include <iostream>
#include <cstring>
#include <stack>
#include <queue>

using std::cout;
using std::cin;
using std::endl;
using std::string;
using std::stack;
using std::queue;

const int ALPHA_SIZE = 26;
int sum = 0;     // Global variable for LCP sum
stack <int> lcp; //Keeps track of current LCP

// Trie Data Structure Implementation (See References Section)
struct TrieNode
{
    struct TrieNode *children[ALPHA_SIZE]; // Search space can be further reduced by keeping track of required indicies
    string label;
    int count; // Keeps track of repeat substrings
};
typedef struct TrieNode Trie;

Trie *getNode(void)
{
    Trie *parent = new Trie;
    parent->count = 0;
    parent->label = ""; // Root Label at level 0 is an empty string
    for(int i = 0; i <ALPHA_SIZE; i++)
        parent->children[i] = NULL;

    return parent;
}

void insert(Trie *root, string key)
{
    Trie *temp = root;

    for(int i = 0; i < key.length(); i++)
    {
        int index = key[i] - 'a';   // Lowercase alphabets only
        if(!temp->children[index])
        {
            temp->children[index] = getNode();
            temp->children[index]->label = key[i]; // Label represents the character being inserted into the node
        }
        temp = temp->children[index];
    }
    temp->count++;
}

// Returns the count of child nodes for a given node
int countChildren(Trie *node, int *index)
{
    int count = 0;

    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        if(node->children[i] != NULL)
        {
            count++;
            *index = i; //Not required for this problem, used in compressed trie implementation
        }
    }
    return count;
}

// Displays the Trie in DFS manner
void display(Trie *root)
{
    Trie *temp = root;
    int index = 0;
    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        if(temp->children[i] != NULL)
        {
            cout<<temp->label<<"->"<<temp->children[i]->label<<endl; // Display in this format : Root->Child
            cout<<"CountOfChildren:"<<countChildren(temp,&index)<<endl; // Count of Child nodes for Root
            cout<<"Counter:"<<temp->children[i]->count<<endl; // Count of repeat substrings for a given node
            display(temp->children[i]);
        }
    }
}

/* COMPRESSED TRIE IMPLEMENTATION
void compress(Trie *root)
{
    Trie *temp = root;
    int index = 0;

    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        if(temp->children[i])
        {
            Trie *child = temp->children[i];

            //if(!child->isEndOfWord)
            {
                if(countChildren(child,&index) >= 2)
                {
                    compress(child);
                }
                else if(countChildren(child,&index) == 1)
                {
                    while(countChildren(child,&index) < 2 and countChildren(child,&index) > 0)
                    {
                        Trie *sub_child = child->children[index];

                        child->label = child->label + sub_child->label;
                        //child->isEndOfWord = sub_child->isEndOfWord;
                        memcpy(child->children,sub_child->children,sizeof(sub_child->children));

                        delete(sub_child);
                    }
                    compress(child);
                }
            }
        }
    }
}
*/

// Calculate LCP Sum recursively
void lcp_sum(Trie *root,int *counter,string lcp_label,queue <int> *s_count)
{
    Trie *temp = root;
    int index = 0;

    // Traverse through this root's children array, to find child nodes
    for(int i = 0; i < ALPHA_SIZE; i++)
    {
        // If child nodes found, then ...
        if(temp->children[i] != NULL)
        {
            Trie *child = temp->children[i];

            // Check if LCP stack is empty
            if(lcp.empty())
            {
                lcp_label = child->label;   // Set LCP label as Child's label
                *counter = 0;               // To make sure counter is not -1 during recursion

                /*
                    * To include LCP of repeat substrings, multiply the count variable with current LCP Label's length
                    * Push this to a stack called lcp
                */
                lcp.push(child->count*lcp_label.length());

                // Add LCP for (a,a)
                sum += lcp.top() * child->count; // Formula to calculate sum for repeat substrings : (child->count) ^ 2 * LCP Label's Length
                *counter += 1; // Increment counter, this is used further to pop elements from the stack lcp, when a branching node is encountered
            }
            else
            {
                lcp_label = lcp_label + child->label; // If not empty, then add Child's label to LCP label
                stack <int> temp = lcp; // Temporary Stack

                /*
                    To calculate LCP for different combinations of substrings,
                    2 -> accounts for (a,b) and (b,a)
                    temp->top() -> For previous substrings and their combinations with the current substring
                    child->count() -> For any repeat substrings for current node/substring
                */
                while(!temp.empty())
                {
                    sum = sum + 2 * temp.top() * child->count;
                    temp.pop();
                }

                // Similar to above explanation for if block
                lcp.push(child->count*lcp_label.length());
                sum += lcp.top() * child->count;
                *counter += 1;
            }

            // If a branching node is encountered
            if(countChildren(child,&index) > 1)
            {
                int lc = 0; // dummy variable
                queue <int> ss_count; // queue to keep track of substrings (counter) from the child node of the branching node
                lcp_sum(child,&lc,lcp_label,&ss_count); // Recursively calculate LCP for child node

                // This part is experimental, does not work for all testcases
                // Used to calculate the LCP count for substrings between child nodes of the branching node
                if(countChildren(child,&index) == 2)
                {
                    int counter_queue = ss_count.front();
                    ss_count.pop();

                    while(counter_queue--)
                    {
                        sum = sum +  2 * ss_count.front() * lcp_label.length();
                        ss_count.pop();
                    }
                }
                else
                {
                    // Unclear, what happens if children is > 3
                    // Should one take combination of each child node with one another ?
                    while(!ss_count.empty())
                    {
                        sum = sum +  2 * ss_count.front() * lcp_label.length();
                        ss_count.pop();
                    }
                }

                lcp_label = temp->label; // Set LCP label back to Root's Label

                // Empty the stack till counter is 0, so as to restore it's state when it first entered the child node from the branching node
                while(*counter)
                {
                    lcp.pop();
                    *counter -=1;
                }
                continue; // Continue to next child of the branching node
            }
            else if (countChildren(child,&index) == 1)
            {
                // If count of children is 1, then recursively calculate LCP for further child node
                lcp_sum(child,counter,lcp_label,s_count);
            }
            else
            {
                // If count of child nodes is 0, then push the counter to the queue for that node
                s_count->push(*counter);
                // Empty the stack till counter is 0, so as to restore it's state when it first entered the child node from the branching node
                while(*counter)
                {
                    lcp.pop();
                    *counter -=1;
                }
                lcp_label = temp->label; // Set LCP label back to Root's Label

            }
        }
    }
}

/* SEARCHING A TRIE
bool search(Trie *root, string key)
{
    Trie *temp = root;

    for(int 

stg*_*lov 6

您的直觉正在朝着正确的方向发展。

基本上,每当看到子字符串的LCP问题时,都应该考虑后缀数据结构,例如后缀树后缀数组后缀自动机。后缀树可以说是功能最强大,最容易处理的树,它们可以很好地解决此问题。

后缀树是一个Trie,包含字符串的所有满足,每个非分支边缘链都压缩为一个长边。具有所有条件的普通trie的问题在于它具有O(N ^ 2)个节点,因此需要O(N ^ 2)个内存。假设您可以使用简单的动态编程预先计算O(N ^ 2)时间和空间中所有对的LCP,则后缀树没有压缩就不好了。压缩的特里占用O(N)内存,但是如果您使用O(N ^ 2)算法构建它,则仍然无效(就像在代码中一样)。您应该使用Ukkonen算法,以直接在O(n)的时间压缩形式构建后缀树。学习和实现此算法并非易事,也许您会发现Web可视化有帮助的。最后一点,为简单起见,我假设在字符串的末尾添加了定点字符(例如dollar $),以确保所有叶子都是后缀树中的显式节点。

注意:

  1. 字符串的每个后缀都表示为从根到树上的叶子的路径(回想哨兵)。这是1-1对应。
  2. 字符串的每个子字符串都表示为从根到树中某个节点(包括“长边内”的隐式节点)的路径,反之亦然。而且,所有等值的子字符串都映射到同一路径。为了了解有多少相等的子字符串映射到特定的根节点路径,请计算节点下方的子树中有多少叶子。
  3. 为了找到两个子串的LCP,找到它们对应的根节点路径,并取节点的LCA。那么LCP是LCA顶点的深度。当然,这将是一个物理顶点,并且其边缘向下延伸。

这是主要思想。考虑所有成对的子字符串,并将它们分为具有相同LCA顶点的组。换句话说,让我们计算A [v]:= LCA顶点恰好是v的子串对的数量。如果为每个顶点v计算此数字,则剩下的要解决的问题是:将每个数字乘以节点的深度并得到总和。同样,数组A [*]仅占用O(N)空间,这意味着我们还没有失去在线性时间内解决整个问题的机会。

回想一下,每个子字符串都是根节点路径。考虑两个节点(代表两个任意子串)和一个顶点v。让我们将在顶点v处具有根的子树称为“ v-子树”。然后:

  • 如果两个节点都在v-subtree内,则它们的LCA也在v-subtree内
  • 否则,它们的LCA在v子树之外,因此它可以双向工作。

让我们介绍另一个数量B [v]:= LCA顶点在v -subtree 内的子串对的数量。上面的语句显示了一种计算B [v]的有效方法:它只是v-subtree中节点数平方,因为其中的每对节点都符合标准。但是,此处应考虑多重性,因此每个节点必须计入与其对应的子字符串一样多的次数。

公式如下:

    B[v] = Q[v]^2
    Q[v] = sum_s( Q[s] + M[s] * len(vs) )    for s in sons(v)
    M[v] = sum_s( M[s] )                     for s in sons(v)
Run Code Online (Sandbox Code Playgroud)

M [v]是顶点的多重性(即v子树中存在多少个叶子),而Q [v]是v子树中考虑了多重性的节点数。当然,您可以自己推断出叶子的基本情况。使用这些公式,您可以在O(N)时间内遍历树时计算M [*],Q [*],B [*]。

仅需使用B [*]数组来计算A [*]数组。可以通过简单的排除公式在O(N)中完成:

A[v] = B[v] - sum_s( B[s] )           for s in sons(v)
Run Code Online (Sandbox Code Playgroud)

如果实现所有这些,您将能够在完美的O(N)时间和空间上解决整个问题。或者说更好:O(NC)时空,其中C是字母的大小。


Ada*_*rag 4

要解决该问题,请按如下所示进行。

如果你看一下图片,查找 abc我已经对 abc 的所有子串进行了尝试。

由于添加了所有子字符串,因此 trie 中的每个节点的 endOfWord 都为 true。

现在开始和我一起以 DFS 方式遍历树:

  1. 总和 = 0,堆栈 = {空}

  2. 我们a先相遇。现在 L(A,B)a可以与其自身形成 1 对。因此现在sum=sum+lengthsum变为 1。现在将长度即 1 压入堆栈。堆栈 = {1}

  3. 搬到b现在。子串现在是ab. ablikea可以与自身形成一对。因此现在做sum=sum+length并且sum变成3。将堆栈内容复制到 stack2。我们得到 1 作为栈顶。这意味着LCPaba1。但它们可以形成 L(a,ab) 和 L(ab,a)。所以添加 sum = sum + 2 * stack.top() 。Sum 变为 3+2 = 5。现在将 stack2 复制回 stack 并压入长度,即 2。stack 变为 {2,1}。

  4. 搬去c。子串是abc. 它将与自身形成 1 对,因此加 3。总和变为 5+3 = 8。将 stack 复制到 stack2。在顶部,我们有 2 abc,并将ab给出 LCP 2,它们将形成 2 对。所以总和=总和+2*2。总和变为 12。弹出 2。堆栈现在有 1。abc并且a具有LCP 1并且可以形成2对。因此,总和变为 12+2 = 14。将 stack2 复制回堆栈并将 length 即 3 压入堆栈。

  5. 我们到达了 trie 的末尾。清空堆栈并从b长度为 1 的位置开始并按上述方式继续。这里总和变成 14+1 = 15

  6. 我们到达c。子串在bc这里。总和将变为 15 + 2 + 2*1(顶部) = 19。

  7. 我们到达了 trie 的末尾。c从长度 1开始。现在总和 = 19+1 = 20。

时间复杂度:O(N^3)。因为生成子字符串需要 O(N^2) 时间,将它们插入到 trie 中需要 O(N) 时间。节点创建是恒定时间。由于所有子串的长度都不为 N,因此所需时间小于 N^3,但 TC 将为 O(N^3)。

我已经测试了这个方法,它只为具有不同字符的单词提供正确的输出。

对于允许重复字符的单词,它会失败。为了求解允许字符重复的单词,您需要存储有关 L(A,B) 的位置 A 和 B 处单词出现次数的信息。在堆栈中,我们需要推送长度和 B_count 对。然后,您可以使用当前子字符串的长度(在堆栈中)* B_count(在堆栈中)* A_count 求LCP之和。我不知道有什么方法可以在不使用 4 个循环的情况下找到 A、B 计数。

请参阅下面的图片以了解单词abb

图片1 图2 图3

就这样。谢谢。