Ukkonen的简明英语后缀树算法

Nat*_*ley 1065 language-agnostic string algorithm suffix-tree

此时我觉得有点厚.我花了几天时间试图完全用后缀树构建我的头,但由于我没有数学背景,因为他们开始过度使用数学符号系统时,许多解释都没有.最接近我发现的一个很好的解释是使用后缀树进行快速字符串搜索,但是他隐藏了各种点,并且算法的某些方面仍然不清楚.

在Stack Overflow中对此算法的逐步解释对于我以外的许多其他人来说都是非常宝贵的,我敢肯定.

作为参考,这里是Ukkonen关于该算法的论文:http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

到目前为止我的基本理解:

  • 我需要迭代给定字符串T的每个前缀P.
  • 我需要遍历前缀P中的每个后缀S并将其添加到树中
  • 要将后缀S添加到树中,我需要遍历S中的每个字符,迭代包括沿着现有分支向下走,该分支以S中的相同字符集C开头,并且当我可能将边缘拆分为后代节点时在后缀中找到不同的字符,或者如果没有匹配的边缘则向下走.当没有找到匹配的边缘向下走C时,为C创建一个新的叶边.

基本算法似乎是O(n 2),正如我们在大多数解释中所指出的那样,因为我们需要遍历所有前缀,然后我们需要逐步遍历每个前缀的每个后缀.由于他使用的后缀指针技术,Ukkonen的算法显然是独一无二的,尽管我认为是我无法理解的.

我也很难理解:

  • 确切地指定,使用和更改"活动点"的时间和方式
  • 算法的典型化方面发生了什么
  • 为什么我看到的实现需要"修复"他们正在使用的边界变量

这是完成的C#源代码.它不仅工作正常,而且支持自动规范化,并提供更好看的输出文本图形.源代码和示例输出位于:

https://gist.github.com/2373868


更新2017-11-04

多年以后,我发现了后缀树的新用途,并在JavaScript中实现了该算法.要点如下.它应该没有错误.npm install chalk从同一位置将其转储到js文件中,然后使用node.js运行以查看一些彩色输出.在同一个Gist中有一个精简版本,没有任何调试代码.

https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6

jog*_*pan 2322

The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.

First, a few preliminary statements.

  1. What we are building, is basically like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth

  2. But: Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers [from,to]. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).

Basic principle

I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:

abc
Run Code Online (Sandbox Code Playgroud)

The algorithm works in steps, from left to right. There is one step for every character of the string. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).

So, we start from the left, and first insert only the single character a by creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#], which means the edge represents the substring starting at position 0 and ending at the current end. I use the symbol # to mean the current end, which is at position 1 (right after a).

So we have an initial tree, which looks like this:

And what it means is this:

Now we progress to position 2 (right after b). Our goal at each step is to insert all suffixes up to the current position. We do this by

  • expanding the existing a-edge to ab
  • inserting one new edge for b

In our representation this looks like

enter image description here

它意味着:

我们观察到两件事:

  • 对于边缘的表示ab相同的,因为它使用的是在初始树:[0,#].其含义自动更改,因为我们将当前位置#从1 更新为2.
  • 每个边消耗O(1)空间,因为它只包含两个指向文本的指针,无论它代表多少个字符.

接下来,我们再次递增位置并通过将a附加c到每个现有边并为新后缀插入一个新边来更新树c.

在我们的表示中,这看起来像

它意味着:

我们观察到:

  • 树是 每个步骤之后到当前位置的正确后缀树
  • 步骤与文本中的字符一样多
  • The amount of work in each step is O(1), because all existing edges are updated automatically by incrementing #, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required.

First extension: Simple repetitions

Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:

abcabxabcd
Run Code Online (Sandbox Code Playgroud)

It starts with abc as in the previous example, then ab is repeated and followed by x, and then abc is repeated followed by d.

Steps 1 through 3: After the first 3 steps we have the tree from the previous example:

Step 4: We move # to position 4. This implicitly updates all existing edges to this:

我们需要a在根目录下插入当前步骤的最后一个后缀.

在我们这样做之前,我们引入了另外两个变量(除此之外 #),这些变量当然一直存在,但到目前为止我们还没有使用它们:

  • 活性点,这是一个三 (active_node,active_edge,active_length)
  • remainder,这是表明我们需要多少新的后缀插入一个整数

这两者的确切含义很快就会清楚,但现在让我们说:

  • In the simple abc example, the active point was always (root,'\0x',0), i.e. active_node was the root node, active_edge was specified as the null character '\0x', and active_length was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information.
  • The remainder was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

Now this is going to change. When we insert the current final character a at the root, we notice that there is already an outgoing edge starting with a, specifically: abca. Here is what we do in such a case:

  • We do not insert a fresh edge [4,#] at the root node. Instead we simply notice that the suffix a is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are.
  • We set the active point to (root,'a',1). That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character a. That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description).
  • We also increment remainder, so at the beginning of the next step it will be 2.

Observation: When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder). The tree is then not an accurate representation of the suffix tree up to the current position any more, but it contains all suffixes (because the final suffix a is contained implicitly). Hence, apart from updating the variables (which are all of fixed length, so this is O(1)), there was no work done in this step.

Step 5: We update the current position # to 5. This automatically updates the tree to this:

And because remainder is 2, we need to insert two final suffixes of the current position: ab and b. This is basically because:

  • The a suffix from the previous step has never been properly inserted. So it has remained, and since we have progressed one step, it has now grown from a to ab.
  • And we need to insert the new final edge b.

In practice this means that we go to the active point (which points to behind the a on what is now the abcab edge), and insert the current final character b. But: Again, it turns out that b is also already present on that same edge.

So, again, we do not change the tree. We simply:

  • Update the active point to (root,'a',2) (same node and edge as before, but now we point to behind the b)
  • Increment the remainder to 3 because we still have not properly inserted the final edge from the previous step, and we don't insert the current final edge either.

To be clear: We had to insert ab and b in the current step, but because ab was already found, we updated the active point and did not even attempt to insert b. Why? Because if ab is in the tree, every suffix of it (including b) must be in the tree, too. Perhaps only implicitly, but it must be there, because of the way we have built the tree so far.

We proceed to step 6 by incrementing #. The tree is automatically updated to:

Because remainder is 3, we have to insert abx, bx and x. The active point tells us where ab ends, so we only need to jump there and insert the x. Indeed, x is not there yet, so we split the abcabx edge and insert an internal node:

The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.

So we have dealt with abx and decrement remainder to 2. Now we need to insert the next remaining suffix, bx. But before we do that we need to update the active point. The rule for this, after splitting and inserting an edge, will be called Rule 1 below, and it applies whenever the active_node is root (we will learn rule 3 for other cases further below). Here is rule 1:

After an insertion from root,

  • active_node remains root
  • active_edge is set to the first character of the new suffix we need to insert, i.e. b
  • active_length is reduced by 1

Hence, the new active-point triple (root,'b',1) indicates that the next insert has to be made at the bcabx edge, behind 1 character, i.e. behind b. We can identify the insertion point in O(1) time and check whether x is already present or not. If it was present, we would end the current step and leave everything the way it is. But x is not present, so we insert it by splitting the edge:

Again, this took O(1) time and we update remainder to 1 and the active point to (root,'x',0) as rule 1 states.

But there is one more thing we need to do. We'll call this Rule 2:

If we split an edge and insert a new node, and if that is not the first node created during the current step, we connect the previously inserted node and the new node through a special pointer, a suffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:

We still need to insert the final suffix of the current step, x. Since the active_length component of the active node has fallen to 0, the final insert is made at the root directly. Since there is no outgoing edge at the root node starting with x, we insert a new edge:

As we can see, in the current step all remaining inserts were made.

We proceed to step 7 by setting #=7, which automatically appends the next character, a, to all leaf edges, as always. Then we attempt to insert the new final character to the active point (the root), and find that it is there already. So we end the current step without inserting anything and update the active point to (root,'a',1).

In step 8, #=8, we append b, and as seen before, this only means we update the active point to (root,'a',2) and increment remainder without doing anything else, because b is already present. However, we notice (in O(1) time) that the active point is now at the end of an edge. We reflect this by re-setting it to (node1,'\0x',0). Here, I use node1 to refer to the internal node the ab edge ends at.

Then, in step #=9, we need to insert 'c' and this will help us to understand the final trick:

Second extension: Using suffix links

As always, the # update appends c automatically to the leaf edges and we go to the active point to see if we can insert 'c'. It turns out 'c' exists already at that edge, so we set the active point to (node1,'c',1), increment remainder and do nothing else.

Now in step #=10, remainder, and so we first need to insert abcd (which remains from 3 steps ago) by inserting d at the active point.

Attempting to insert d at the active point causes an edge split in O(1) time:

The active_node, from which the split was initiated, is marked in red above. Here is the final rule, Rule 3:

After splitting an edge from an active_node that is not the root node, we follow the suffix link going out of that node, if there is any, and reset the active_node to the node it points to. If there is no suffix link, we set the active_node to the root. active_edge and active_length remain unchanged.

So the active point is now (node2,'c',1), and node2 is marked in red below:

Since the insertion of abcd is complete, we decrement remainder to 3 and consider the next remaining suffix of the current step, bcd. Rule 3 has set the active point to just the right node and edge so inserting bcd can be done by simply inserting its final character d at the active point.

Doing this causes another edge split, and because of rule 2, we must create a suffix link from the previously inserted node to the new one:

We observe: Suffix links enable us to reset the active point so we can make the next remaining insert at O(1) effort. Look at the graph above to confirm that indeed node at label ab is linked to the node at b (its suffix), and the node at abc is linked to bc.

The current step is not finished yet. remainder is now 2, and we need to follow rule 3 to reset the active point again. Since the current active_node (red above) has no suffix link, we reset to root. The active point is now (root,'c',1).

Hence the next insert occurs at the one outgoing edge of the root node whose label starts with c: cabxabcd, behind the first character, i.e. behind c. This causes another split:

And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:

(I am using Graphviz Dot for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)

With this, remainder can be set to 1 and since the active_node is root, we use rule 1 to update the active point to (root,'d',0). This means the final insert of the current step is to insert a single d at root:

That was the final step and we are done. There are number of final observations, though:

  • In each step we move # forward by 1 position. This automatically updates all leaf nodes in O(1) time.

  • But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step.

  • remainder tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position #. We consider one after the other and make the insert. Important: Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are contained implicitly (otherwise the active point would not be where it is).

  • After each such insert, we decrement remainder and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time.

  • If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if remainder>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all implicit in the current tree. The fact that remainder>0 makes sure we deal with the remaining suffixes later.

  • What if at the end of the algorithm remainder>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign $ is used as a symbol for that. Why does that matter? --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. Howe

    • 对不起,这比我希望的要长一点.我意识到它解释了我们都知道的一些微不足道的事情,而困难的部分可能仍然不是很清楚.让我们一起编辑它. (73认同)
    • 我应该补充一点,这不是基于Dan Gusfield的书中的描述.这是通过首先考虑没有重复的字符串然后讨论如何处理重复来描述算法的新尝试.我希望这会更直观. (67认同)
    • 谢谢@jogojapan,由于你的解释,我能够写一个完整的例子.我已经发布了源代码,所以希望其他人可能会发现它的用途:https://gist.github.com/2373868 (7认同)
    • 好的,我的代码已被完全重写,现在可以在所有情况下正常工作,包括自动封装,还有一个更好的文本图输出.https://gist.github.com/2373868 (6认同)
    • @NathanRidley是的(顺便说一下,最后一点是Ukkonen称之为canonicize).触发它的一种方法是确保有一个子串出现三次,并以一个字符串结束,该字符串在另一个上下文中再次出现.例如`abcdefabxybcdmnabcdex`.`abcd`的初始部分在`abxy`中重复(这在`ab`之后创建一个内部节点)并在`abcdex`中重复,它以`bcd`结尾,它不仅出现在`bcdex`上下文中,但也在`bcdmn`背景下.插入`abcdex`之后,我们按照后缀链接插入`bcdex`,然后canonicize将启动. (3认同)
    • 这是一个很好的解释,但我从这里更清楚、更快速地理解它 https://youtu.be/F3nbY3hIDLQ?t=37m19s ,这是麻省理工学院开放课件的讲座,他先解释了压缩的树,然后继续解释后缀树. 我会说这是必看的。 (3认同)
    • 这是jogojapan的一个很棒的答案.我也在努力理解和实施它时遇到了困难.在旁注中,我面临的一个问题是验证我的算法实现 - 有两次我运行代码并在一些测试用例中发现了bug. (2认同)
    • 当标记化不可能时,例如在搜索基因序列而不是自然语言时,后缀树(或更好的后缀数组)是一种选择,尤其是。当搜索模式很长时。在识别重复子串、正方形或回文子串等模式时,它们也很好。在生物信息学中有各种专门的应用。但是,我不认为这里描述的 Ukkonen 算法现在在实践中使用得非常多;我猜大多数对它感兴趣的人是出于教育原因而感兴趣的。 (2认同)
    • 关于此的注释:“但是,如果我们要使用树来搜索常规子字符串,而不仅仅是主字符串的后缀,则实际上并不需要最后一步,如以下OP的注释所建议。” 通常,您可以使用确定的后缀树来报告特定子字符串的出现次数,方法是仅跟踪从根到该子字符串定义的特定节点或边缘的路径,然后计算该子树中的叶子数。但是该属性不适用于未完成的后缀树。 (2认同)

mak*_*nov 128

我尝试使用jogojapan的答案中给出的方法来实现后缀树,但由于用于规则的措辞,它在某些情况下不起作用.此外,我已经提到没有人设法使用这种方法实现一个绝对正确的后缀树.下面我将写一个jogojapan答案的"概述",并对规则进行一些修改.我还将描述忘记创建重要后缀链接的情况.

使用其他变量

  1. 活动点 - 三元组(active_node; active_edge; active_length),显示我们必须从哪里开始插入新后缀.
  2. 余数 - 显示我们必须明确添加的后缀数量.例如,如果我们的单词是'abcaabca',而余数= 3,则意味着我们必须处理3个最后的后缀:bca,caa.

让我们使用内部节点的概念- 除了叶子之外的所有节点都是内部节点.

观察1

当我们发现需要插入的最后一个后缀已经存在于树中时,树本身根本没有改变(我们只更新了active pointremainder).

观察2

如果在某一点active_length上大于或等于当前edge(edge_length)的长度,我们active point向下移动直到edge_length严格大于active_length.

现在,让我们重新定义规则:

规则1

如果从活动节点 = root插入后,活动长度大于0,则:

  1. 活动节点未更改
  2. 活动长度递减
  3. 活动边缘向右移动(到下一个后缀的第一个字符,我们必须插入)

规则2

如果我们创建了一个新的内部节点 从做出插入内部节点,这是不是第一次此类 内部节点在当前步骤,那么我们链路中的前SUCH与节点THIS通过一个后缀链接.

这个定义Rule 2与jogojapan'不同,因为我们不仅考虑了新创建的内部节点,还考虑了我们进行插入的内部节点.

规则3

从插入后活动节点是不是节点,我们必须按照后缀链路和设置主动节点所指向的节点.如果不存在一个链路后缀,设置活动节点节点.无论哪种方式,有效边沿有效长度保持不变.

在这个定义中Rule 3我们还考虑了叶节点的插入(不仅是分裂节点).

最后,观察3:

当我们想要添加到树的符号已经在边缘时,我们根据Observation 1,仅更新active pointremainder保持树不变.但是如果有一个标记为需要后缀链接内部节点,我们必须通过后缀链接将该节点与我们的当前节点连接起来.active node

如果我们在这种情况下添加后缀链接,如果我们不这样做,那么让我们看看cdddcdc的后缀树示例:

  1. 如果我们通过后缀链接连接节点:

    • 在添加最后一个字母c之前:

    • 添加最后一个字母c后:

  2. 如果我们DO通过后缀链接连接的节点:

    • 在添加最后一个字母c之前:

    • 添加最后一个字母c后:

似乎没有显着差异:在第二种情况下,还有两个后缀链接.但是这些后缀链接是正确的,其中一个 - 从蓝色节点到红色节点 - 对于我们的主动点方法非常重要.问题是,如果我们不在这里添加后缀链接,稍后,当我们向树中添加一些新字母时,我们可能会省略一些节点添加到树中,因为,根据它,如果没有后缀链接,那么我们必须把根目录.Rule 3active_node

当我们将最后一个字母添加到树中时,在我们从蓝色节点(边缘标记为'c')插入之前,红色节点已经存在.由于蓝色节点有插入,我们将其标记为需要后缀链接.然后,依靠活动点方法,将其设置为红色节点.但是我们不从红色节点插入,因为字母'c'已经在边缘.这是否意味着蓝色节点必须没有后缀链接?不,我们必须通过后缀链接将蓝色节点与红色节点连接起来.为什么这是正确的?因为主动点方法保证我们到达正确的位置,即下一个我们必须处理较短后缀插入的地方.active node

最后,这是我对后缀树的实现:

  1. Java的
  2. C++

希望这个"概述"结合jogojapan的详细答案将有助于某人实现自己的后缀树.

  • 非常感谢,+1为你的努力.我相信你是对的..虽然我没有时间立刻考虑细节.我稍后会检查,也可能会修改我的答案. (3认同)
  • 对于规则3,一种聪明的方法是将root的后缀链接设置为root本身,(默认情况下)将每个节点的后缀链接设置为root。因此,我们可以避免使用条件,而只需跟随后缀链接即可。 (3认同)

mut*_*tux 9

感谢@jogojapan精心解释的教程,我在Python中实现了算法.

@jogojapan提到的一些小问题比我预期的要复杂得多,需要非常小心对待.我花了几天的时间才能让我的实现足够强大(我想).问题和解决方案如下:

  1. 最后Remainder > 0结果事实证明,在展开步骤中也可能发生这种情况,而不仅仅是整个算法的结束.当发生这种情况时,我们可以保持余数,actnode,actedge和actlength 不变,结束当前的展开步骤,并通过保持折叠或展开来开始另一步,具体取决于原始字符串中的下一个字符是否在当前路径上或不.

  2. 跳过节点:当我们遵循后缀链接时,更新活动点,然后发现其active_length组件与新的active_node不兼容.我们必须向前移动到合适的位置裂开,或插入叶.这个过程可能是不那么简单,因为移动actlength期间actedge不断变化一路,当你不得不搬回到根节点actedgeactlength可能是错误的,因为这些举措的.我们需要额外的变量来保存这些信息.

    在此输入图像描述

@managonov以某种方式指出了另外两个问题

  1. 拆分可能会退化当尝试拆分边时,有时您会发现拆分操作在节点上是正确的.在这种情况下,我们只需要向该节点添加一个新的叶子,将其作为标准的边缘分割操作,这意味着后缀链接(如果有的话)应该相应地进行维护.

  2. 隐藏的后缀链接问题1问题2引起了另一个特殊情况.有时我们需要跳过几个节点到正确的点进行拆分,如果我们通过比较余数字符串和路径标签来移动,我们可能会超过正确的点.那种情况下,后缀链接将被无意中忽略,如果有的话.通过在前进时记住正确的点可以避免这种情况.如果拆分节点已存在,则应保留后缀链接,或者甚至在展开步骤期间发生问题1.

最后,我在Python中的实现如下:

提示: 它包含上面代码中的天真树打印功能,这在调试时非常重要.它为我节省了大量时间,便于查找特殊情况.


Kam*_*mil 7

@jogojapan,您带来了很棒的解释和可视化。但是正如@makagonov所说,它缺少一些有关设置后缀链接的规则。通过“ aabaaabb”一词逐步浏览http://brenden.github.io/ukkonen-animation/时,可以很好地看到它。当您从步骤10转到步骤11时,没有从节点5到节点2的后缀链接,但是活动点突然在那里移动。

@makagonov,因为我生活在Java世界中,所以我也尝试遵循您的实现来掌握ST构建工作流程,但由于以下原因,我感到很难:

  • 结合边缘和节点
  • 使用索引指针而不是引用
  • 破坏陈述;
  • 继续陈述;

因此,我最终用Java实现了这种实现,希望以更清晰的方式反映所有步骤,并减少其他Java人员的学习时间:

import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;

public class ST {

  public class Node {
    private final int id;
    private final Map<Character, Edge> edges;
    private Node slink;

    public Node(final int id) {
        this.id = id;
        this.edges = new HashMap<>();
    }

    public void setSlink(final Node slink) {
        this.slink = slink;
    }

    public Map<Character, Edge> getEdges() {
        return this.edges;
    }

    public Node getSlink() {
        return this.slink;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"id\"")
                .append(":")
                .append(this.id)
                .append(",")
                .append("\"slink\"")
                .append(":")
                .append(this.slink != null ? this.slink.id : null)
                .append(",")
                .append("\"edges\"")
                .append(":")
                .append(edgesToString(word))
                .append("}")
                .toString();
    }

    private StringBuilder edgesToString(final String word) {
        final StringBuilder edgesStringBuilder = new StringBuilder();
        edgesStringBuilder.append("{");
        for(final Map.Entry<Character, Edge> entry : this.edges.entrySet()) {
            edgesStringBuilder.append("\"")
                    .append(entry.getKey())
                    .append("\"")
                    .append(":")
                    .append(entry.getValue().toString(word))
                    .append(",");
        }
        if(!this.edges.isEmpty()) {
            edgesStringBuilder.deleteCharAt(edgesStringBuilder.length() - 1);
        }
        edgesStringBuilder.append("}");
        return edgesStringBuilder;
    }

    public boolean contains(final String word, final String suffix) {
        return !suffix.isEmpty()
                && this.edges.containsKey(suffix.charAt(0))
                && this.edges.get(suffix.charAt(0)).contains(word, suffix);
    }
  }

  public class Edge {
    private final int from;
    private final int to;
    private final Node next;

    public Edge(final int from, final int to, final Node next) {
        this.from = from;
        this.to = to;
        this.next = next;
    }

    public int getFrom() {
        return this.from;
    }

    public int getTo() {
        return this.to;
    }

    public Node getNext() {
        return this.next;
    }

    public int getLength() {
        return this.to - this.from;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"content\"")
                .append(":")
                .append("\"")
                .append(word.substring(this.from, this.to))
                .append("\"")
                .append(",")
                .append("\"next\"")
                .append(":")
                .append(this.next != null ? this.next.toString(word) : null)
                .append("}")
                .toString();
    }

    public boolean contains(final String word, final String suffix) {
        if(this.next == null) {
            return word.substring(this.from, this.to).equals(suffix);
        }
        return suffix.startsWith(word.substring(this.from,
                this.to)) && this.next.contains(word, suffix.substring(this.to - this.from));
    }
  }

  public class ActivePoint {
    private final Node activeNode;
    private final Character activeEdgeFirstCharacter;
    private final int activeLength;

    public ActivePoint(final Node activeNode,
                       final Character activeEdgeFirstCharacter,
                       final int activeLength) {
        this.activeNode = activeNode;
        this.activeEdgeFirstCharacter = activeEdgeFirstCharacter;
        this.activeLength = activeLength;
    }

    private Edge getActiveEdge() {
        return this.activeNode.getEdges().get(this.activeEdgeFirstCharacter);
    }

    public boolean pointsToActiveNode() {
        return this.activeLength == 0;
    }

    public boolean activeNodeIs(final Node node) {
        return this.activeNode == node;
    }

    public boolean activeNodeHasEdgeStartingWith(final char character) {
        return this.activeNode.getEdges().containsKey(character);
    }

    public boolean activeNodeHasSlink() {
        return this.activeNode.getSlink() != null;
    }

    public boolean pointsToOnActiveEdge(final String word, final char character) {
        return word.charAt(this.getActiveEdge().getFrom() + this.activeLength) == character;
    }

    public boolean pointsToTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() == this.activeLength;
    }

    public boolean pointsAfterTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() < this.activeLength;
    }

    public ActivePoint moveToEdgeStartingWithAndByOne(final char character) {
        return new ActivePoint(this.activeNode, character, 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge() {
        return new ActivePoint(this.getActiveEdge().getNext(), null, 0);
    }

    public ActivePoint moveToSlink() {
        return new ActivePoint(this.activeNode.getSlink(),
                this.activeEdgeFirstCharacter,
                this.activeLength);
    }

    public ActivePoint moveTo(final Node node) {
        return new ActivePoint(node, this.activeEdgeFirstCharacter, this.activeLength);
    }

    public ActivePoint moveByOneCharacter() {
        return new ActivePoint(this.activeNode,
                this.activeEdgeFirstCharacter,
                this.activeLength + 1);
    }

    public ActivePoint moveToEdgeStartingWithAndByActiveLengthMinusOne(final Node node,
                                                                       final char character) {
        return new ActivePoint(node, character, this.activeLength - 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge(final String word, final int index) {
        return new ActivePoint(this.getActiveEdge().getNext(),
                word.charAt(index - this.activeLength + this.getActiveEdge().getLength()),
                this.activeLength - this.getActiveEdge().getLength());
    }

    public void addEdgeToActiveNode(final char character, final Edge edge) {
        this.activeNode.getEdges().put(character, edge);
    }

    public void splitActiveEdge(final String word,
                                final Node nodeToAdd,
                                final int index,
                                final char character) {
        final Edge activeEdgeToSplit = this.getActiveEdge();
        final Edge splittedEdge = new Edge(activeEdgeToSplit.getFrom(),
                activeEdgeToSplit.getFrom() + this.activeLength,
                nodeToAdd);
        nodeToAdd.getEdges().put(word.charAt(activeEdgeToSplit.getFrom() + this.activeLength),
                new Edge(activeEdgeToSplit.getFrom() + this.activeLength,
                        activeEdgeToSplit.getTo(),
                        activeEdgeToSplit.getNext()));
        nodeToAdd.getEdges().put(character, new Edge(index, word.length(), null));
        this.activeNode.getEdges().put(this.activeEdgeFirstCharacter, splittedEdge);
    }

    public Node setSlinkTo(final Node previouslyAddedNodeOrAddedEdgeNode,
                           final Node node) {
        if(previouslyAddedNodeOrAddedEdgeNode != null) {
            previouslyAddedNodeOrAddedEdgeNode.setSlink(node);
        }
        return node;
    }

    public Node setSlinkToActiveNode(final Node previouslyAddedNodeOrAddedEdgeNode) {
        return setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, this.activeNode);
    }
  }

  private static int idGenerator;

  private final String word;
  private final Node root;
  private ActivePoint activePoint;
  private int remainder;

  public ST(final String word) {
    this.word = word;
    this.root = new Node(idGenerator++);
    this.activePoint = new ActivePoint(this.root, null, 0);
    this.remainder = 0;
    build();
  }

  private void build() {
    for(int i = 0; i < this.word.length(); i++) {
        add(i, this.word.charAt(i));
    }
  }

  private void add(final int index, final char character) {
    this.remainder++;
    boolean characterFoundInTheTree = false;
    Node previouslyAddedNodeOrAddedEdgeNode = null;
    while(!characterFoundInTheTree && this.remainder > 0) {
        if(this.activePoint.pointsToActiveNode()) {
            if(this.activePoint.activeNodeHasEdgeStartingWith(character)) {
                activeNodeHasEdgeStartingWithCharacter(character, previouslyAddedNodeOrAddedEdgeNode);
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    rootNodeHasNotEdgeStartingWithCharacter(index, character);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = internalNodeHasNotEdgeStartingWithCharacter(index,
                            character, previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
        else {
            if(this.activePoint.pointsToOnActiveEdge(this.word, character)) {
                activeEdgeHasCharacter();
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromRootNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromInternalNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
    }
  }

  private void activeNodeHasEdgeStartingWithCharacter(final char character,
                                                    final Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByOne(character);
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private void rootNodeHasNotEdgeStartingWithCharacter(final int index, final char character) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    this.activePoint = this.activePoint.moveTo(this.root);
    this.remainder--;
    assert this.remainder == 0;
  }

  private Node internalNodeHasNotEdgeStartingWithCharacter(final int index,
                                                         final char character,
                                                         Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private void activeEdgeHasCharacter() {
    this.activePoint = this.activePoint.moveByOneCharacter();
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private Node edgeFromRootNodeHasNotCharacter(final int index,
                                             final char character,
                                             Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByActiveLengthMinusOne(this.root,
            this.word.charAt(index - this.remainder + 2));
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private Node edgeFromInternalNodeHasNotCharacter(final int index,
                                                 final char character,
                                                 Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private ActivePoint walkDown(final int index) {
    while(!this.activePoint.pointsToActiveNode()
            && (this.activePoint.pointsToTheEndOfActiveEdge() || this.activePoint.pointsAfterTheEndOfActiveEdge())) {
        if(this.activePoint.pointsAfterTheEndOfActiveEdge()) {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge(this.word, index);
        }
        else {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
        }
    }
    return this.activePoint;
  }

  public String toString(final String word) {
    return this.root.toString(word);
  }

  public boolean contains(final String suffix) {
    return this.root.contains(this.word, suffix);
  }

  public static void main(final String[] args) {
    final String[] words = {
            "abcabcabc$",
            "abc$",
            "abcabxabcd$",
            "abcabxabda$",
            "abcabxad$",
            "aabaaabb$",
            "aababcabcd$",
            "ababcabcd$",
            "abccba$",
            "mississipi$",
            "abacabadabacabae$",
            "abcabcd$",
            "00132220$"
    };
    Arrays.stream(words).forEach(word -> {
        System.out.println("Building suffix tree for word: " + word);
        final ST suffixTree = new ST(word);
        System.out.println("Suffix tree: " + suffixTree.toString(word));
        for(int i = 0; i < word.length() - 1; i++) {
            assert suffixTree.contains(word.substring(i)) : word.substring(i);
        }
    });
  }
}
Run Code Online (Sandbox Code Playgroud)


MrV*_*ype 7

道歉,如果我的回答似乎多余,但我最近实施了Ukkonen的算法,发现自己已经为之奋斗了好几天。我必须通读有关该主题的多篇论文,以了解该算法某些核心方面的原因和方式。

我发现先前答案的“规则”方法无助于理解其根本原因,因此,我在下文中仅着重于语用学方面的内容。如果您像我一样努力遵循其他说明,也许我的补充说明会为您“点击”。

我在这里发布了C#实现:https//github.com/baratgabor/SuffixTree

请注意,我不是该主题的专家,因此以下各节可能包含错误(或更糟)。如果遇到任何问题,请随时进行编辑。

先决条件

以下说明的起点假定您熟悉后缀树的内容和用法,以及Ukkonen算法的特征,例如,如何从头到尾逐个字符地扩展后缀树。基本上,我假设您已经阅读了其他一些说明。

(但是,我确实必须为流程添加一些基本的叙述,因此开始时确实可能感觉很多余。)

最有趣的部分是对使用后缀链接和从根目录重新扫描之间的区别解释。这就是我在实施过程中遇到的许多错误和头痛的原因。

开放式叶节点及其局限性

我确定您已经知道最基本的“技巧”是意识到我们可以只保留后缀“ open”的结尾,即引用字符串的当前长度,而不是将结尾设置为静态值。这样,当我们添加其他字符时,这些字符将隐式添加到所有后缀标签中,而无需访问和更新所有后缀。

但是,由于明显的原因,后缀的这种开放结尾仅适用于表示字符串结尾的节点,即树结构中的叶节点。我们在树上执行的分支操作(添加新的分支节点和叶节点)不会自动传播到所需的任何地方。

重复的子字符串不会显式地出现在树中,这很可能是基本的,因此无需赘述,因为由于树是重复的,因此树已经包含了这些子字符串。但是,当重复子字符串由于遇到非重复字符而结束时,我们需要在该点创建一个分支以表示从该点开始的分支。

例如,对于字符串“ ABCXABCY”(请参见下文),需要将XY的分支添加到三个不同的后缀ABCBCC中;否则,它将不是有效的后缀树,并且通过从根向下匹配字符,我们无法找到字符串的所有子字符串。

再次强调一下– 我们对树中后缀执行的任何操作也必须由其连续后缀(例如,ABC> BC> C)反映出来,否则它们不再是有效的后缀。

在后缀中重复分支

但是,即使我们接受必须进行这些手动更新,我们如何知道需要更新多少个后缀?因为,当我们添加重复字符A(以及连续的其余重复字符)时,我们还不知道何时/何地需要将后缀分成两个分支。仅当我们遇到第一个非重复字符时才确定需要拆分,在这种情况下为Y(而不是树中已经存在的X)。

我们可以做的是匹配最长的重复字符串,并计算以后需要更新的后缀个数。这就是“剩余”的意思。

“剩余”和“重新扫描”的概念

变量remainder告诉我们隐式添加了多少个重复字符,没有分支;也就是说,一旦发现无法匹配的第一个字符,我们需要访问多少个后缀以重复分支操作。这实质上等于从树的根开始我们在树中有多少个“深”字符。

因此,与字符串ABCXABCY的前面的示例相同,我们“隐式” 匹配重复的ABC部分,remainder每次递增,结果为3的余数。然后遇到非重复字符“ Y”。在这里,我们拆了以前添加ABCXABC - > XABC - > ÿ。然后我们remainder从3 递减到2,因为我们已经处理了ABC分支。现在我们通过从根开始匹配最后两个字符BC来重复操作,直到需要分割的点,然后我们也将BCX分割为BC- > XBC - > ÿ。同样,我们递减remainder为1,然后重复该操作;直到the remainder为0。最后,我们还需要将当前字符(Y)本身也添加到根中。

该操作从根开始连续跟随后缀,直到我们需要进行操作的点,这在Ukkonen的算法中称为“重新扫描”,通常这是算法中最昂贵的部分。想象一下一个更长的字符串,其中您需要跨数十个节点(我们将在后面讨论),可能需要数千次“重新扫描”长子字符串。

作为解决方案,我们介绍了所谓的“后缀链接”

“后缀链接”的概念

后缀链接基本上指向我们通常必须“重新扫描”的位置,因此,代替昂贵的重新扫描操作,我们可以简单地跳至链接位置,进行工作,跳至下一个链接位置,然后重复–直到出现没有更多职位可更新。

当然,一个大问题是如何添加这些链接。现有的答案是,我们可以在插入新的分支节点时添加链接,这利用了以下事实:在树的每个扩展中,自然需要按照确切的顺序一个接一个地创建分支节点,我们需要将它们链接在一起。虽然,我们必须从最后创建的分支节点(最长的后缀)链接到先前创建的分支节点,所以我们需要缓存我们创建的最后一个分支节点,将其链接到我们创建的下一个分支节点,并缓存新创建的分支节点。

结果是实际上我们通常没有后缀链接,因为给定的分支节点是刚刚创建的。在这些情况下,我们必须从根本上退回到前述的“重新扫描”。这就是为什么在插入后会提示您使用后缀链接或跳转到根目录的原因。

(或者,如果您将父指针存储在节点中,则可以尝试跟随父节点,检查它们是否具有链接,并使用该链接。我发现很少提及该链接,但是后缀链接用法并未提及在石集,有多种可能的方法,如果你了解底层机制可以实现一个适合您的需求是最好的。)

“活跃点”的概念

到目前为止,我们讨论了用于构建树的多种有效工具,并且模糊地涉及遍历多个边缘和节点,但是尚未探讨相应的结果和复杂性。

前面解释的“剩余”概念对于跟踪我们在树中的位置很有用,但我们必须意识到它没有存储足够的信息。

首先,我们总是驻留在节点的特定边缘上,因此我们需要存储边缘信息。我们称其为“主动边缘”

其次,即使添加了边缘信息后,我们仍然没有办法确定一个位置,在树越往下,而不是直接连接到节点。因此,我们还需要存储该节点。我们将此称为“活动节点”

最后,我们可以注意到,“余数”不足以标识未直接连接到根的边上的位置,因为“余数”是整条路线的长度。而且我们可能不想打扰记住和减去前一条边的长度。因此,我们需要一种表示形式,基本上是当前边缘上的其余部分。这就是我们所说的“活动长度”

这导致了我们所谓的“活动点” –三个变量的包,其中包含我们需要维护的有关树中位置的所有信息:

Active Point = (Active Node, Active Edge, Active Length)

您可以在下图上观察到,ABCABD的匹配路由是如何由边缘AB上的2个字符组成的(来自root),以及边缘CABDABCABD上的4个字符组成的(来自节点4)–导致“剩余”为6个字符。因此,我们当前的位置可以标识为活动节点4,活动边缘C,活动长度4

剩余点和活动点

“活动点”的另一个重要作用是为我们的算法提供了一个抽象层,这意味着我们算法的各个部分都可以在“活动点”上进行工作,而不管该活动点是在根还是在其他任何地方。这使得在我们的算法中以简洁明了的方式轻松实现后缀链接的使用。

重新扫描与使用后缀链接的区别

现在,棘手的部分(根据我的经验)可能会导致大量的错误和头痛,并且在大多数来源中都没有很好地解释,这是处理后缀链接案例与重新扫描案例的区别。

考虑以下字符串'AAAABAAAABAAC'的示例:

剩余在多个边缘

您可以在上面观察到“余数” 7 如何对应于来自根的字符总和,而“活动长度” 4如何对应于来自活动节点的活动边缘的匹配字符之和。

现在,在活动点执行分支操作之后,活动节点可能包含后缀链接,也可能不包含后缀链接。

如果存在后缀链接:我们只需要处理“活动长度”部分。该“其余部分,是无关紧要的,因为在这里我们通过后缀链接跳转到该节点已编码正确的“余”含蓄,只是凭借在树枝上,它是。

如果不存在后缀链接:我们需要从零/根开始“重新扫描”,这意味着从头开始处理整个后缀。为此,我们必须使用整个“剩余”作为重新扫描的基础。

有和没有后缀链接的处理示例比较

考虑上面的示例的下一步会发生什么。让我们比较一下如何获得相同的结果(即移至下一个要处理的后缀)(带或不带后缀链接)。

使用“后缀链接”

通过后缀链接到达连续的后缀

请注意,如果我们使用后缀链接,我们将自动位于“正确的位置”。由于“有效长度”可能与新职位“不兼容”,因此通常并非严格如此。

在上述情况下,由于“有效长度”为4,所以我们从链接的节点4开始使用后缀“ ABAA”。但是在找到与后缀的第一个字符('A'),我们注意到我们的“有效长度”在此边沿溢出了3个字符。因此,我们跳过整个边缘,移至下一个节点,并根据跳转所消耗的字符来减少“活动长度”

然后,在找到与后缀'BAA ' 相对应的下一个边缘'B'之后,我们最终注意到边缘长度大于剩余的“有效长度” 3,这意味着我们找到了正确的位置。

请注意,似乎此操作通常不被称为“重新扫描”,即使在我看来,这与重新扫描直接等效,只是缩短了长度且没有根目录起始点。

使用“重新扫描”

通过重新扫描达到连续的后缀

请注意,如果我们使用传统的“重新扫描”操作(在这里假装没有后缀链接),那么我们将从树的顶部开始,从根开始,然后我们必须再次向下移动到正确的位置,沿当前后缀的整个长度。

此后缀的长度是我们前面讨论的“余数”。我们必须消耗掉剩余的全部,直到达到零。这可能(并且经常如此)包括跳过多个节点,每次跳过都会使剩余部分减少我们跳过的边的长度。最后,我们到达的边缘比剩余的“余数”更长;在这里,我们将有效边设置为给定的边,将“有效长度”设置为剩余的“剩余 ”,就可以了。

但是请注意,实际的“ remainder”变量需要保留,并且仅在每次插入节点后才递减。因此,我上面描述的假设使用的是初始化为'remainder'的单独变量。

关于后缀链接和重新扫描的注意事项

1)请注意,两种方法均会导致相同的结果。但是,在大多数情况下,后缀链接跳转明显更快。这就是后缀链接的全部原理。

2)实际的算法实现无需区别。如上所述,即使在使用后缀链接的情况下,“有效长度”也常常与链接位置不兼容,因为树的该分支可能包含其他分支。因此,从本质上讲,您只需要使用“有效长度”而不是“剩余长度”,并执行相同的重新扫描逻辑,直到找到比剩余后缀长度短的边即可。

3)关于性能的一个重要说明是,在重新扫描期间无需检查每个字符。由于有效的后缀树的构建方式,我们可以安全地假设字符匹配。因此,您主要是在计算长度,并且当我们跳到新的边缘时,唯一需要进行字符等效性检查,因为边缘由其第一个字符标识(在给定节点的上下文中始终是唯一的)。这意味着“重新扫描”逻辑与全字符串匹配逻辑(即在树中搜索子字符串)不同。

4)此处描述的原始后缀链接只是可能的方法之一。例如NJ Larsson等。将该方法命名为“ 面向节点的自上而下”,并将其与“ 面向节点的自下而上”和两个“面向边缘的”方法进行比较。不同的方法具有不同的典型和最坏情况下的性能,要求,限制等,但是通常看来,面向边缘的方法是对原始方法的整体改进。


Pet*_*vaz 6

我的直觉如下:

在主循环的k次迭代之后,您构建了一个后缀树,其中包含以前k个字符开头的完整字符串的所有后缀.

在开始时,这意味着后缀树包含表示整个字符串的单个根节点(这是从0开始的唯一后缀).

在len(字符串)迭代之后,您有一个包含所有后缀的后缀树.

在循环期间,键是活动点.我的猜测是,这代表后缀树中最深的一点,它对应于字符串前k个字符的正确后缀.(我认为正确意味着后缀不能是整个字符串.)

例如,假设您已经看过字符'abcabc'.活动点将表示树中对应于后缀"abc"的点.

活动点由(origin,first,last)表示.这意味着您当前位于树中的点,从节点原点开始,然后以字符串[first:last]中的字符输入

添加新角色时,您会查看活动点是否仍在现有树中.如果是,那么你就完成了.否则,您需要将新节点添加到活动点的后缀树,回退到下一个最短匹配,然后再次检查.

注1:后缀指针为每个节点提供下一个最短匹配的链接.

注意2:添加新节点和回退时,为新节点添加新的后缀指针.此后缀指针的目标将是缩短活动点的节点.此节点将已存在,或者在此回退循环的下一次迭代中创建.

注3:标准化部分只是节省了检查活动点的时间.例如,假设您总是使用origin = 0,并且只是首先和最后一次更改.要检查活动点,每次沿着所有中间节点都必须遵循后缀树.通过仅记录距离最后一个节点的距离来缓存跟踪此路径的结果是有意义的.

你能给出一个"修复"边界变量的代码示例吗?

健康警告:我也发现这个算法特别难以理解,所以请认识到这种直觉可能在所有重要细节中都是错误的......