Nat*_*ley 1065 language-agnostic string algorithm suffix-tree
此时我觉得有点厚.我花了几天时间试图完全用后缀树构建我的头,但由于我没有数学背景,因为他们开始过度使用数学符号系统时,许多解释都没有.最接近我发现的一个很好的解释是使用后缀树进行快速字符串搜索,但是他隐藏了各种点,并且算法的某些方面仍然不清楚.
在Stack Overflow中对此算法的逐步解释对于我以外的许多其他人来说都是非常宝贵的,我敢肯定.
作为参考,这里是Ukkonen关于该算法的论文:http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
到目前为止我的基本理解:
基本算法似乎是O(n 2),正如我们在大多数解释中所指出的那样,因为我们需要遍历所有前缀,然后我们需要逐步遍历每个前缀的每个后缀.由于他使用的后缀指针技术,Ukkonen的算法显然是独一无二的,尽管我认为这是我无法理解的.
我也很难理解:
这是完成的C#源代码.它不仅工作正常,而且支持自动规范化,并提供更好看的输出文本图形.源代码和示例输出位于:
更新2017-11-04
多年以后,我发现了后缀树的新用途,并在JavaScript中实现了该算法.要点如下.它应该没有错误.npm install chalk
从同一位置将其转储到js文件中,然后使用node.js运行以查看一些彩色输出.在同一个Gist中有一个精简版本,没有任何调试代码.
https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6
jog*_*pan 2322
The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.
First, a few preliminary statements.
What we are building, is basically like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth
But: Unlike in a search trie, the edge labels are not single
characters. Instead, each edge is labeled using a pair of integers
[from,to]
. These are pointers into the text. In this sense, each
edge carries a string label of arbitrary length, but takes only O(1)
space (two pointers).
I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:
abc
Run Code Online (Sandbox Code Playgroud)
The algorithm works in steps, from left to right. There is one step for every character of the string. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).
So, we start from the left, and first insert only the single character
a
by creating an edge from the root node (on the left) to a leaf,
and labeling it as [0,#]
, which means the edge represents the
substring starting at position 0 and ending at the current end. I
use the symbol #
to mean the current end, which is at position 1
(right after a
).
So we have an initial tree, which looks like this:
And what it means is this:
Now we progress to position 2 (right after b
). Our goal at each step
is to insert all suffixes up to the current position. We do this
by
a
-edge to ab
b
In our representation this looks like
它意味着:
我们观察到两件事:
ab
是相同的,因为它使用的是在初始树:[0,#]
.其含义自动更改,因为我们将当前位置#
从1 更新为2.接下来,我们再次递增位置并通过将a附加c
到每个现有边并为新后缀插入一个新边来更新树c
.
在我们的表示中,这看起来像
它意味着:
我们观察到:
#
, and inserting the
one new edge for the final character can be done in O(1)
time. Hence for a string of length n, only O(n) time is required.Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:
abcabxabcd
Run Code Online (Sandbox Code Playgroud)
It starts with abc
as in the previous example, then ab
is repeated
and followed by x
, and then abc
is repeated followed by d
.
Steps 1 through 3: After the first 3 steps we have the tree from the previous example:
Step 4: We move #
to position 4. This implicitly updates all existing
edges to this:
我们需要a
在根目录下插入当前步骤的最后一个后缀.
在我们这样做之前,我们引入了另外两个变量(除此之外
#
),这些变量当然一直存在,但到目前为止我们还没有使用它们:
(active_node,active_edge,active_length)
remainder
,这是表明我们需要多少新的后缀插入一个整数这两者的确切含义很快就会清楚,但现在让我们说:
abc
example, the active point was always
(root,'\0x',0)
, i.e. active_node
was the root node, active_edge
was specified as the null character '\0x'
, and active_length
was zero. The effect of this was that the one new edge that
we inserted in every step was inserted at the root node as a
freshly created edge. We will see soon why a triple is necessary to
represent this information.remainder
was always set to 1 at the beginning of each
step. The meaning of this was that the number of suffixes we had to
actively insert at the end of each step was 1 (always just the
final character).Now this is going to change. When we insert the current final
character a
at the root, we notice that there is already an outgoing
edge starting with a
, specifically: abca
. Here is what we do in
such a case:
[4,#]
at the root node. Instead we
simply notice that the suffix a
is already in our
tree. It ends in the middle of a longer edge, but we are not
bothered by that. We just leave things the way they are.(root,'a',1)
. That means the active
point is now somewhere in the middle of outgoing edge of the root node that starts with a
, specifically, after position 1 on that edge. We
notice that the edge is specified simply by its first
character a
. That suffices because there can be only one edge
starting with any particular character (confirm that this is true after reading through the entire description).remainder
, so at the beginning of the next step
it will be 2.Observation: When the final suffix we need to insert is found to
exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder
). The tree
is then not an accurate representation of the suffix tree up to the
current position any more, but it contains all suffixes (because the final
suffix a
is contained implicitly). Hence, apart from updating the
variables (which are all of fixed length, so this is O(1)), there was
no work done in this step.
Step 5: We update the current position #
to 5. This
automatically updates the tree to this:
And because remainder
is 2, we need to insert two final
suffixes of the current position: ab
and b
. This is basically because:
a
suffix from the previous step has never been properly
inserted. So it has remained, and since we have progressed one
step, it has now grown from a
to ab
.b
.In practice this means that we go to the active point (which points to
behind the a
on what is now the abcab
edge), and insert the
current final character b
. But: Again, it turns out that b
is
also already present on that same edge.
So, again, we do not change the tree. We simply:
(root,'a',2)
(same node and edge
as before, but now we point to behind the b
)remainder
to 3 because we still have not properly
inserted the final edge from the previous step, and we don't insert
the current final edge either.To be clear: We had to insert ab
and b
in the current step, but
because ab
was already found, we updated the active point and did
not even attempt to insert b
. Why? Because if ab
is in the tree,
every suffix of it (including b
) must be in the tree,
too. Perhaps only implicitly, but it must be there, because of the
way we have built the tree so far.
We proceed to step 6 by incrementing #
. The tree is
automatically updated to:
Because remainder
is 3, we have to insert abx
, bx
and
x
. The active point tells us where ab
ends, so we only need to
jump there and insert the x
. Indeed, x
is not there yet, so we
split the abcabx
edge and insert an internal node:
The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.
So we have dealt with abx
and decrement remainder
to 2. Now we
need to insert the next remaining suffix, bx
. But before we do that
we need to update the active point. The rule for this, after splitting
and inserting an edge, will be called Rule 1 below, and it applies whenever the
active_node
is root (we will learn rule 3 for other cases further
below). Here is rule 1:
After an insertion from root,
active_node
remains rootactive_edge
is set to the first character of the new suffix we need to insert, i.e.b
active_length
is reduced by 1
Hence, the new active-point triple (root,'b',1)
indicates that the
next insert has to be made at the bcabx
edge, behind 1 character,
i.e. behind b
. We can identify the insertion point in O(1) time and
check whether x
is already present or not. If it was present, we
would end the current step and leave everything the way it is. But x
is not present, so we insert it by splitting the edge:
Again, this took O(1) time and we update remainder
to 1 and the
active point to (root,'x',0)
as rule 1 states.
But there is one more thing we need to do. We'll call this Rule 2:
If we split an edge and insert a new node, and if that is not the first node created during the current step, we connect the previously inserted node and the new node through a special pointer, a suffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:
We still need to insert the final suffix of the current step,
x
. Since the active_length
component of the active node has fallen
to 0, the final insert is made at the root directly. Since there is no
outgoing edge at the root node starting with x
, we insert a new
edge:
As we can see, in the current step all remaining inserts were made.
We proceed to step 7 by setting #
=7, which automatically appends the next character,
a
, to all leaf edges, as always. Then we attempt to insert the new final
character to the active point (the root), and find that it is there
already. So we end the current step without inserting anything and
update the active point to (root,'a',1)
.
In step 8, #
=8, we append b
, and as seen before, this only
means we update the active point to (root,'a',2)
and increment remainder
without doing
anything else, because b
is already present. However, we notice (in O(1) time) that the active point
is now at the end of an edge. We reflect this by re-setting it to
(node1,'\0x',0)
. Here, I use node1
to refer to the
internal node the ab
edge ends at.
Then, in step #
=9, we need to insert 'c' and this will help us to
understand the final trick:
As always, the #
update appends c
automatically to the leaf edges
and we go to the active point to see if we can insert 'c'. It turns
out 'c' exists already at that edge, so we set the active point to
(node1,'c',1)
, increment remainder
and do nothing else.
Now in step #
=10, remainder
, and so we first need to insert
abcd
(which remains from 3 steps ago) by inserting d
at the active
point.
Attempting to insert d
at the active point causes an edge split in
O(1) time:
The active_node
, from which the split was initiated, is marked in
red above. Here is the final rule, Rule 3:
After splitting an edge from an
active_node
that is not the root node, we follow the suffix link going out of that node, if there is any, and reset theactive_node
to the node it points to. If there is no suffix link, we set theactive_node
to the root.active_edge
andactive_length
remain unchanged.
So the active point is now (node2,'c',1)
, and node2
is marked in
red below:
Since the insertion of abcd
is complete, we decrement remainder
to
3 and consider the next remaining suffix of the current step,
bcd
. Rule 3 has set the active point to just the right node and edge
so inserting bcd
can be done by simply inserting its final character
d
at the active point.
Doing this causes another edge split, and because of rule 2, we must create a suffix link from the previously inserted node to the new one:
We observe: Suffix links enable us to reset the active point so we
can make the next remaining insert at O(1) effort. Look at the
graph above to confirm that indeed node at label ab
is linked to
the node at b
(its suffix), and the node at abc
is linked to
bc
.
The current step is not finished yet. remainder
is now 2, and we
need to follow rule 3 to reset the active point again. Since the
current active_node
(red above) has no suffix link, we reset to
root. The active point is now (root,'c',1)
.
Hence the next insert occurs at the one outgoing edge of the root node
whose label starts with c
: cabxabcd
, behind the first character,
i.e. behind c
. This causes another split:
And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:
(I am using Graphviz Dot for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)
With this, remainder
can be set to 1 and since the active_node
is
root, we use rule 1 to update the active point to (root,'d',0)
. This
means the final insert of the current step is to insert a single d
at root:
That was the final step and we are done. There are number of final observations, though:
In each step we move #
forward by 1 position. This automatically
updates all leaf nodes in O(1) time.
But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step.
remainder
tells us how many additional inserts we need to
make. These inserts correspond one-to-one to the final suffixes of
the string that ends at the current position #
. We consider one
after the other and make the insert. Important: Each insert is
done in O(1) time since the active point tells us exactly where to
go, and we need to add only one single character at the active
point. Why? Because the other characters are contained implicitly
(otherwise the active point would not be where it is).
After each such insert, we decrement remainder
and follow the
suffix link if there is any. If not we go to root (rule 3). If we
are at root already, we modify the active point using rule 1. In
any case, it takes only O(1) time.
If, during one of these inserts, we find that the character we want
to insert is already there, we don't do anything and end the
current step, even if remainder
>0. The reason is that any
inserts that remain will be suffixes of the one we just tried to
make. Hence they are all implicit in the current tree. The fact
that remainder
>0 makes sure we deal with the remaining suffixes
later.
What if at the end of the algorithm remainder
>0? This will be the
case whenever the end of the text is a substring that occurred
somewhere before. In that case we must append one extra character
at the end of the string that has not occurred before. In the
literature, usually the dollar sign $
is used as a symbol for
that. Why does that matter? --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder
to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. Howe
mak*_*nov 128
我尝试使用jogojapan的答案中给出的方法来实现后缀树,但由于用于规则的措辞,它在某些情况下不起作用.此外,我已经提到没有人设法使用这种方法实现一个绝对正确的后缀树.下面我将写一个jogojapan答案的"概述",并对规则进行一些修改.我还将描述忘记创建重要后缀链接的情况.
使用其他变量
让我们使用内部节点的概念- 除了根和叶子之外的所有节点都是内部节点.
观察1
当我们发现需要插入的最后一个后缀已经存在于树中时,树本身根本没有改变(我们只更新了active point
和remainder
).
观察2
如果在某一点active_length
上大于或等于当前edge(edge_length
)的长度,我们active point
向下移动直到edge_length
严格大于active_length
.
现在,让我们重新定义规则:
规则1
如果从活动节点 = root插入后,活动长度大于0,则:
- 活动节点未更改
- 活动长度递减
- 活动边缘向右移动(到下一个后缀的第一个字符,我们必须插入)
规则2
如果我们创建了一个新的内部节点 或从做出插入内部节点,这是不是第一次此类 内部节点在当前步骤,那么我们链路中的前SUCH与节点THIS通过一个后缀链接.
这个定义Rule 2
与jogojapan'不同,因为我们不仅考虑了新创建的内部节点,还考虑了我们进行插入的内部节点.
规则3
从插入后活动节点是不是根节点,我们必须按照后缀链路和设置主动节点所指向的节点.如果不存在一个链路后缀,设置活动节点到根节点.无论哪种方式,有效边沿和有效长度保持不变.
在这个定义中Rule 3
我们还考虑了叶节点的插入(不仅是分裂节点).
最后,观察3:
当我们想要添加到树的符号已经在边缘时,我们根据Observation 1
,仅更新active point
并remainder
保持树不变.但是如果有一个标记为需要后缀链接的内部节点,我们必须通过后缀链接将该节点与我们的当前节点连接起来.active node
如果我们在这种情况下添加后缀链接,如果我们不这样做,那么让我们看看cdddcdc的后缀树示例:
如果我们不通过后缀链接连接节点:
如果我们DO通过后缀链接连接的节点:
似乎没有显着差异:在第二种情况下,还有两个后缀链接.但是这些后缀链接是正确的,其中一个 - 从蓝色节点到红色节点 - 对于我们的主动点方法非常重要.问题是,如果我们不在这里添加后缀链接,稍后,当我们向树中添加一些新字母时,我们可能会省略一些节点添加到树中,因为,根据它,如果没有后缀链接,那么我们必须把根目录.Rule 3
active_node
当我们将最后一个字母添加到树中时,在我们从蓝色节点(边缘标记为'c')插入之前,红色节点已经存在.由于蓝色节点有插入,我们将其标记为需要后缀链接.然后,依靠活动点方法,将其设置为红色节点.但是我们不从红色节点插入,因为字母'c'已经在边缘.这是否意味着蓝色节点必须没有后缀链接?不,我们必须通过后缀链接将蓝色节点与红色节点连接起来.为什么这是正确的?因为主动点方法保证我们到达正确的位置,即下一个我们必须处理较短后缀插入的地方.active node
最后,这是我对后缀树的实现:
希望这个"概述"结合jogojapan的详细答案将有助于某人实现自己的后缀树.
感谢@jogojapan精心解释的教程,我在Python中实现了算法.
@jogojapan提到的一些小问题比我预期的要复杂得多,需要非常小心对待.我花了几天的时间才能让我的实现足够强大(我想).问题和解决方案如下:
最后Remainder > 0
结果事实证明,在展开步骤中也可能发生这种情况,而不仅仅是整个算法的结束.当发生这种情况时,我们可以保持余数,actnode,actedge和actlength 不变,结束当前的展开步骤,并通过保持折叠或展开来开始另一步,具体取决于原始字符串中的下一个字符是否在当前路径上或不.
跳过节点:当我们遵循后缀链接时,更新活动点,然后发现其active_length组件与新的active_node不兼容.我们必须向前移动到合适的位置裂开,或插入叶.这个过程可能是不那么简单,因为移动actlength期间actedge不断变化一路,当你不得不搬回到根节点的actedge和actlength可能是错误的,因为这些举措的.我们需要额外的变量来保存这些信息.
@managonov以某种方式指出了另外两个问题
拆分可能会退化当尝试拆分边时,有时您会发现拆分操作在节点上是正确的.在这种情况下,我们只需要向该节点添加一个新的叶子,将其作为标准的边缘分割操作,这意味着后缀链接(如果有的话)应该相应地进行维护.
隐藏的后缀链接问题1和问题2引起了另一个特殊情况.有时我们需要跳过几个节点到正确的点进行拆分,如果我们通过比较余数字符串和路径标签来移动,我们可能会超过正确的点.那种情况下,后缀链接将被无意中忽略,如果有的话.通过在前进时记住正确的点可以避免这种情况.如果拆分节点已存在,则应保留后缀链接,或者甚至在展开步骤期间发生问题1.
最后,我在Python中的实现如下:
提示: 它包含上面代码中的天真树打印功能,这在调试时非常重要.它为我节省了大量时间,便于查找特殊情况.
@jogojapan,您带来了很棒的解释和可视化。但是正如@makagonov所说,它缺少一些有关设置后缀链接的规则。通过“ aabaaabb”一词逐步浏览http://brenden.github.io/ukkonen-animation/时,可以很好地看到它。当您从步骤10转到步骤11时,没有从节点5到节点2的后缀链接,但是活动点突然在那里移动。
@makagonov,因为我生活在Java世界中,所以我也尝试遵循您的实现来掌握ST构建工作流程,但由于以下原因,我感到很难:
因此,我最终用Java实现了这种实现,希望以更清晰的方式反映所有步骤,并减少其他Java人员的学习时间:
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
public class ST {
public class Node {
private final int id;
private final Map<Character, Edge> edges;
private Node slink;
public Node(final int id) {
this.id = id;
this.edges = new HashMap<>();
}
public void setSlink(final Node slink) {
this.slink = slink;
}
public Map<Character, Edge> getEdges() {
return this.edges;
}
public Node getSlink() {
return this.slink;
}
public String toString(final String word) {
return new StringBuilder()
.append("{")
.append("\"id\"")
.append(":")
.append(this.id)
.append(",")
.append("\"slink\"")
.append(":")
.append(this.slink != null ? this.slink.id : null)
.append(",")
.append("\"edges\"")
.append(":")
.append(edgesToString(word))
.append("}")
.toString();
}
private StringBuilder edgesToString(final String word) {
final StringBuilder edgesStringBuilder = new StringBuilder();
edgesStringBuilder.append("{");
for(final Map.Entry<Character, Edge> entry : this.edges.entrySet()) {
edgesStringBuilder.append("\"")
.append(entry.getKey())
.append("\"")
.append(":")
.append(entry.getValue().toString(word))
.append(",");
}
if(!this.edges.isEmpty()) {
edgesStringBuilder.deleteCharAt(edgesStringBuilder.length() - 1);
}
edgesStringBuilder.append("}");
return edgesStringBuilder;
}
public boolean contains(final String word, final String suffix) {
return !suffix.isEmpty()
&& this.edges.containsKey(suffix.charAt(0))
&& this.edges.get(suffix.charAt(0)).contains(word, suffix);
}
}
public class Edge {
private final int from;
private final int to;
private final Node next;
public Edge(final int from, final int to, final Node next) {
this.from = from;
this.to = to;
this.next = next;
}
public int getFrom() {
return this.from;
}
public int getTo() {
return this.to;
}
public Node getNext() {
return this.next;
}
public int getLength() {
return this.to - this.from;
}
public String toString(final String word) {
return new StringBuilder()
.append("{")
.append("\"content\"")
.append(":")
.append("\"")
.append(word.substring(this.from, this.to))
.append("\"")
.append(",")
.append("\"next\"")
.append(":")
.append(this.next != null ? this.next.toString(word) : null)
.append("}")
.toString();
}
public boolean contains(final String word, final String suffix) {
if(this.next == null) {
return word.substring(this.from, this.to).equals(suffix);
}
return suffix.startsWith(word.substring(this.from,
this.to)) && this.next.contains(word, suffix.substring(this.to - this.from));
}
}
public class ActivePoint {
private final Node activeNode;
private final Character activeEdgeFirstCharacter;
private final int activeLength;
public ActivePoint(final Node activeNode,
final Character activeEdgeFirstCharacter,
final int activeLength) {
this.activeNode = activeNode;
this.activeEdgeFirstCharacter = activeEdgeFirstCharacter;
this.activeLength = activeLength;
}
private Edge getActiveEdge() {
return this.activeNode.getEdges().get(this.activeEdgeFirstCharacter);
}
public boolean pointsToActiveNode() {
return this.activeLength == 0;
}
public boolean activeNodeIs(final Node node) {
return this.activeNode == node;
}
public boolean activeNodeHasEdgeStartingWith(final char character) {
return this.activeNode.getEdges().containsKey(character);
}
public boolean activeNodeHasSlink() {
return this.activeNode.getSlink() != null;
}
public boolean pointsToOnActiveEdge(final String word, final char character) {
return word.charAt(this.getActiveEdge().getFrom() + this.activeLength) == character;
}
public boolean pointsToTheEndOfActiveEdge() {
return this.getActiveEdge().getLength() == this.activeLength;
}
public boolean pointsAfterTheEndOfActiveEdge() {
return this.getActiveEdge().getLength() < this.activeLength;
}
public ActivePoint moveToEdgeStartingWithAndByOne(final char character) {
return new ActivePoint(this.activeNode, character, 1);
}
public ActivePoint moveToNextNodeOfActiveEdge() {
return new ActivePoint(this.getActiveEdge().getNext(), null, 0);
}
public ActivePoint moveToSlink() {
return new ActivePoint(this.activeNode.getSlink(),
this.activeEdgeFirstCharacter,
this.activeLength);
}
public ActivePoint moveTo(final Node node) {
return new ActivePoint(node, this.activeEdgeFirstCharacter, this.activeLength);
}
public ActivePoint moveByOneCharacter() {
return new ActivePoint(this.activeNode,
this.activeEdgeFirstCharacter,
this.activeLength + 1);
}
public ActivePoint moveToEdgeStartingWithAndByActiveLengthMinusOne(final Node node,
final char character) {
return new ActivePoint(node, character, this.activeLength - 1);
}
public ActivePoint moveToNextNodeOfActiveEdge(final String word, final int index) {
return new ActivePoint(this.getActiveEdge().getNext(),
word.charAt(index - this.activeLength + this.getActiveEdge().getLength()),
this.activeLength - this.getActiveEdge().getLength());
}
public void addEdgeToActiveNode(final char character, final Edge edge) {
this.activeNode.getEdges().put(character, edge);
}
public void splitActiveEdge(final String word,
final Node nodeToAdd,
final int index,
final char character) {
final Edge activeEdgeToSplit = this.getActiveEdge();
final Edge splittedEdge = new Edge(activeEdgeToSplit.getFrom(),
activeEdgeToSplit.getFrom() + this.activeLength,
nodeToAdd);
nodeToAdd.getEdges().put(word.charAt(activeEdgeToSplit.getFrom() + this.activeLength),
new Edge(activeEdgeToSplit.getFrom() + this.activeLength,
activeEdgeToSplit.getTo(),
activeEdgeToSplit.getNext()));
nodeToAdd.getEdges().put(character, new Edge(index, word.length(), null));
this.activeNode.getEdges().put(this.activeEdgeFirstCharacter, splittedEdge);
}
public Node setSlinkTo(final Node previouslyAddedNodeOrAddedEdgeNode,
final Node node) {
if(previouslyAddedNodeOrAddedEdgeNode != null) {
previouslyAddedNodeOrAddedEdgeNode.setSlink(node);
}
return node;
}
public Node setSlinkToActiveNode(final Node previouslyAddedNodeOrAddedEdgeNode) {
return setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, this.activeNode);
}
}
private static int idGenerator;
private final String word;
private final Node root;
private ActivePoint activePoint;
private int remainder;
public ST(final String word) {
this.word = word;
this.root = new Node(idGenerator++);
this.activePoint = new ActivePoint(this.root, null, 0);
this.remainder = 0;
build();
}
private void build() {
for(int i = 0; i < this.word.length(); i++) {
add(i, this.word.charAt(i));
}
}
private void add(final int index, final char character) {
this.remainder++;
boolean characterFoundInTheTree = false;
Node previouslyAddedNodeOrAddedEdgeNode = null;
while(!characterFoundInTheTree && this.remainder > 0) {
if(this.activePoint.pointsToActiveNode()) {
if(this.activePoint.activeNodeHasEdgeStartingWith(character)) {
activeNodeHasEdgeStartingWithCharacter(character, previouslyAddedNodeOrAddedEdgeNode);
characterFoundInTheTree = true;
}
else {
if(this.activePoint.activeNodeIs(this.root)) {
rootNodeHasNotEdgeStartingWithCharacter(index, character);
}
else {
previouslyAddedNodeOrAddedEdgeNode = internalNodeHasNotEdgeStartingWithCharacter(index,
character, previouslyAddedNodeOrAddedEdgeNode);
}
}
}
else {
if(this.activePoint.pointsToOnActiveEdge(this.word, character)) {
activeEdgeHasCharacter();
characterFoundInTheTree = true;
}
else {
if(this.activePoint.activeNodeIs(this.root)) {
previouslyAddedNodeOrAddedEdgeNode = edgeFromRootNodeHasNotCharacter(index,
character,
previouslyAddedNodeOrAddedEdgeNode);
}
else {
previouslyAddedNodeOrAddedEdgeNode = edgeFromInternalNodeHasNotCharacter(index,
character,
previouslyAddedNodeOrAddedEdgeNode);
}
}
}
}
}
private void activeNodeHasEdgeStartingWithCharacter(final char character,
final Node previouslyAddedNodeOrAddedEdgeNode) {
this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
this.activePoint = this.activePoint.moveToEdgeStartingWithAndByOne(character);
if(this.activePoint.pointsToTheEndOfActiveEdge()) {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
}
}
private void rootNodeHasNotEdgeStartingWithCharacter(final int index, final char character) {
this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
this.activePoint = this.activePoint.moveTo(this.root);
this.remainder--;
assert this.remainder == 0;
}
private Node internalNodeHasNotEdgeStartingWithCharacter(final int index,
final char character,
Node previouslyAddedNodeOrAddedEdgeNode) {
this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
if(this.activePoint.activeNodeHasSlink()) {
this.activePoint = this.activePoint.moveToSlink();
}
else {
this.activePoint = this.activePoint.moveTo(this.root);
}
this.remainder--;
return previouslyAddedNodeOrAddedEdgeNode;
}
private void activeEdgeHasCharacter() {
this.activePoint = this.activePoint.moveByOneCharacter();
if(this.activePoint.pointsToTheEndOfActiveEdge()) {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
}
}
private Node edgeFromRootNodeHasNotCharacter(final int index,
final char character,
Node previouslyAddedNodeOrAddedEdgeNode) {
final Node newNode = new Node(idGenerator++);
this.activePoint.splitActiveEdge(this.word, newNode, index, character);
previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
this.activePoint = this.activePoint.moveToEdgeStartingWithAndByActiveLengthMinusOne(this.root,
this.word.charAt(index - this.remainder + 2));
this.activePoint = walkDown(index);
this.remainder--;
return previouslyAddedNodeOrAddedEdgeNode;
}
private Node edgeFromInternalNodeHasNotCharacter(final int index,
final char character,
Node previouslyAddedNodeOrAddedEdgeNode) {
final Node newNode = new Node(idGenerator++);
this.activePoint.splitActiveEdge(this.word, newNode, index, character);
previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
if(this.activePoint.activeNodeHasSlink()) {
this.activePoint = this.activePoint.moveToSlink();
}
else {
this.activePoint = this.activePoint.moveTo(this.root);
}
this.activePoint = walkDown(index);
this.remainder--;
return previouslyAddedNodeOrAddedEdgeNode;
}
private ActivePoint walkDown(final int index) {
while(!this.activePoint.pointsToActiveNode()
&& (this.activePoint.pointsToTheEndOfActiveEdge() || this.activePoint.pointsAfterTheEndOfActiveEdge())) {
if(this.activePoint.pointsAfterTheEndOfActiveEdge()) {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge(this.word, index);
}
else {
this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
}
}
return this.activePoint;
}
public String toString(final String word) {
return this.root.toString(word);
}
public boolean contains(final String suffix) {
return this.root.contains(this.word, suffix);
}
public static void main(final String[] args) {
final String[] words = {
"abcabcabc$",
"abc$",
"abcabxabcd$",
"abcabxabda$",
"abcabxad$",
"aabaaabb$",
"aababcabcd$",
"ababcabcd$",
"abccba$",
"mississipi$",
"abacabadabacabae$",
"abcabcd$",
"00132220$"
};
Arrays.stream(words).forEach(word -> {
System.out.println("Building suffix tree for word: " + word);
final ST suffixTree = new ST(word);
System.out.println("Suffix tree: " + suffixTree.toString(word));
for(int i = 0; i < word.length() - 1; i++) {
assert suffixTree.contains(word.substring(i)) : word.substring(i);
}
});
}
}
Run Code Online (Sandbox Code Playgroud)
道歉,如果我的回答似乎多余,但我最近实施了Ukkonen的算法,发现自己已经为之奋斗了好几天。我必须通读有关该主题的多篇论文,以了解该算法某些核心方面的原因和方式。
我发现先前答案的“规则”方法无助于理解其根本原因,因此,我在下文中仅着重于语用学方面的内容。如果您像我一样努力遵循其他说明,也许我的补充说明会为您“点击”。
我在这里发布了C#实现:https://github.com/baratgabor/SuffixTree
请注意,我不是该主题的专家,因此以下各节可能包含错误(或更糟)。如果遇到任何问题,请随时进行编辑。
以下说明的起点假定您熟悉后缀树的内容和用法,以及Ukkonen算法的特征,例如,如何从头到尾逐个字符地扩展后缀树。基本上,我假设您已经阅读了其他一些说明。
(但是,我确实必须为流程添加一些基本的叙述,因此开始时确实可能感觉很多余。)
最有趣的部分是对使用后缀链接和从根目录重新扫描之间的区别的解释。这就是我在实施过程中遇到的许多错误和头痛的原因。
我确定您已经知道最基本的“技巧”是意识到我们可以只保留后缀“ open”的结尾,即引用字符串的当前长度,而不是将结尾设置为静态值。这样,当我们添加其他字符时,这些字符将隐式添加到所有后缀标签中,而无需访问和更新所有后缀。
但是,由于明显的原因,后缀的这种开放结尾仅适用于表示字符串结尾的节点,即树结构中的叶节点。我们在树上执行的分支操作(添加新的分支节点和叶节点)不会自动传播到所需的任何地方。
重复的子字符串不会显式地出现在树中,这很可能是基本的,因此无需赘述,因为由于树是重复的,因此树已经包含了这些子字符串。但是,当重复子字符串由于遇到非重复字符而结束时,我们需要在该点创建一个分支以表示从该点开始的分支。
例如,对于字符串“ ABCXABCY”(请参见下文),需要将X和Y的分支添加到三个不同的后缀ABC,BC和C中;否则,它将不是有效的后缀树,并且通过从根向下匹配字符,我们无法找到字符串的所有子字符串。
再次强调一下– 我们对树中后缀执行的任何操作也必须由其连续后缀(例如,ABC> BC> C)反映出来,否则它们不再是有效的后缀。
但是,即使我们接受必须进行这些手动更新,我们如何知道需要更新多少个后缀?因为,当我们添加重复字符A(以及连续的其余重复字符)时,我们还不知道何时/何地需要将后缀分成两个分支。仅当我们遇到第一个非重复字符时才确定需要拆分,在这种情况下为Y(而不是树中已经存在的X)。
我们可以做的是匹配最长的重复字符串,并计算以后需要更新的后缀个数。这就是“剩余”的意思。
变量remainder
告诉我们隐式添加了多少个重复字符,没有分支;也就是说,一旦发现无法匹配的第一个字符,我们需要访问多少个后缀以重复分支操作。这实质上等于从树的根开始我们在树中有多少个“深”字符。
因此,与字符串ABCXABCY的前面的示例相同,我们“隐式” 匹配重复的ABC部分,remainder
每次递增,结果为3的余数。然后遇到非重复字符“ Y”。在这里,我们拆了以前添加ABCX到ABC - > X和ABC - > ÿ。然后我们remainder
从3 递减到2,因为我们已经处理了ABC分支。现在我们通过从根开始匹配最后两个字符BC来重复操作,直到需要分割的点,然后我们也将BCX分割为BC- > X和BC - > ÿ。同样,我们递减remainder
为1,然后重复该操作;直到the remainder
为0。最后,我们还需要将当前字符(Y)本身也添加到根中。
该操作从根开始连续跟随后缀,直到我们需要进行操作的点,这在Ukkonen的算法中称为“重新扫描”,通常这是算法中最昂贵的部分。想象一下一个更长的字符串,其中您需要跨数十个节点(我们将在后面讨论),可能需要数千次“重新扫描”长子字符串。
作为解决方案,我们介绍了所谓的“后缀链接”。
后缀链接基本上指向我们通常必须“重新扫描”的位置,因此,代替昂贵的重新扫描操作,我们可以简单地跳至链接位置,进行工作,跳至下一个链接位置,然后重复–直到出现没有更多职位可更新。
当然,一个大问题是如何添加这些链接。现有的答案是,我们可以在插入新的分支节点时添加链接,这利用了以下事实:在树的每个扩展中,自然需要按照确切的顺序一个接一个地创建分支节点,我们需要将它们链接在一起。虽然,我们必须从最后创建的分支节点(最长的后缀)链接到先前创建的分支节点,所以我们需要缓存我们创建的最后一个分支节点,将其链接到我们创建的下一个分支节点,并缓存新创建的分支节点。
结果是实际上我们通常没有后缀链接,因为给定的分支节点是刚刚创建的。在这些情况下,我们必须从根本上退回到前述的“重新扫描”。这就是为什么在插入后会提示您使用后缀链接或跳转到根目录的原因。
(或者,如果您将父指针存储在节点中,则可以尝试跟随父节点,检查它们是否具有链接,并使用该链接。我发现很少提及该链接,但是后缀链接用法并未提及在石集,有多种可能的方法,如果你了解底层机制可以实现一个适合您的需求是最好的。)
到目前为止,我们讨论了用于构建树的多种有效工具,并且模糊地涉及遍历多个边缘和节点,但是尚未探讨相应的结果和复杂性。
前面解释的“剩余”概念对于跟踪我们在树中的位置很有用,但我们必须意识到它没有存储足够的信息。
首先,我们总是驻留在节点的特定边缘上,因此我们需要存储边缘信息。我们称其为“主动边缘”。
其次,即使添加了边缘信息后,我们仍然没有办法确定一个位置,在树越往下,而不是直接连接到根节点。因此,我们还需要存储该节点。我们将此称为“活动节点”。
最后,我们可以注意到,“余数”不足以标识未直接连接到根的边上的位置,因为“余数”是整条路线的长度。而且我们可能不想打扰记住和减去前一条边的长度。因此,我们需要一种表示形式,基本上是当前边缘上的其余部分。这就是我们所说的“活动长度”。
这导致了我们所谓的“活动点” –三个变量的包,其中包含我们需要维护的有关树中位置的所有信息:
Active Point = (Active Node, Active Edge, Active Length)
您可以在下图上观察到,ABCABD的匹配路由是如何由边缘AB上的2个字符组成的(来自root),以及边缘CABDABCABD上的4个字符组成的(来自节点4)–导致“剩余”为6个字符。因此,我们当前的位置可以标识为活动节点4,活动边缘C,活动长度4。
“活动点”的另一个重要作用是为我们的算法提供了一个抽象层,这意味着我们算法的各个部分都可以在“活动点”上进行工作,而不管该活动点是在根还是在其他任何地方。这使得在我们的算法中以简洁明了的方式轻松实现后缀链接的使用。
现在,棘手的部分(根据我的经验)可能会导致大量的错误和头痛,并且在大多数来源中都没有很好地解释,这是处理后缀链接案例与重新扫描案例的区别。
考虑以下字符串'AAAABAAAABAAC'的示例:
您可以在上面观察到“余数” 7 如何对应于来自根的字符总和,而“活动长度” 4如何对应于来自活动节点的活动边缘的匹配字符之和。
现在,在活动点执行分支操作之后,活动节点可能包含后缀链接,也可能不包含后缀链接。
如果存在后缀链接:我们只需要处理“活动长度”部分。该“其余部分,是无关紧要的,因为在这里我们通过后缀链接跳转到该节点已编码正确的“余”含蓄,只是凭借在树枝上,它是。
如果不存在后缀链接:我们需要从零/根开始“重新扫描”,这意味着从头开始处理整个后缀。为此,我们必须使用整个“剩余”作为重新扫描的基础。
考虑上面的示例的下一步会发生什么。让我们比较一下如何获得相同的结果(即移至下一个要处理的后缀)(带或不带后缀链接)。
请注意,如果我们使用后缀链接,我们将自动位于“正确的位置”。由于“有效长度”可能与新职位“不兼容”,因此通常并非严格如此。
在上述情况下,由于“有效长度”为4,所以我们从链接的节点4开始使用后缀“ ABAA”。但是在找到与后缀的第一个字符('A'),我们注意到我们的“有效长度”在此边沿溢出了3个字符。因此,我们跳过整个边缘,移至下一个节点,并根据跳转所消耗的字符来减少“活动长度”。
然后,在找到与后缀'BAA ' 相对应的下一个边缘'B'之后,我们最终注意到边缘长度大于剩余的“有效长度” 3,这意味着我们找到了正确的位置。
请注意,似乎此操作通常不被称为“重新扫描”,即使在我看来,这与重新扫描直接等效,只是缩短了长度且没有根目录起始点。
请注意,如果我们使用传统的“重新扫描”操作(在这里假装没有后缀链接),那么我们将从树的顶部开始,从根开始,然后我们必须再次向下移动到正确的位置,沿当前后缀的整个长度。
此后缀的长度是我们前面讨论的“余数”。我们必须消耗掉剩余的全部,直到达到零。这可能(并且经常如此)包括跳过多个节点,每次跳过都会使剩余部分减少我们跳过的边的长度。最后,我们到达的边缘比剩余的“余数”更长;在这里,我们将有效边设置为给定的边,将“有效长度”设置为剩余的“剩余 ”,就可以了。
但是请注意,实际的“ remainder”变量需要保留,并且仅在每次插入节点后才递减。因此,我上面描述的假设使用的是初始化为'remainder'的单独变量。
关于后缀链接和重新扫描的注意事项
1)请注意,两种方法均会导致相同的结果。但是,在大多数情况下,后缀链接跳转明显更快。这就是后缀链接的全部原理。
2)实际的算法实现无需区别。如上所述,即使在使用后缀链接的情况下,“有效长度”也常常与链接位置不兼容,因为树的该分支可能包含其他分支。因此,从本质上讲,您只需要使用“有效长度”而不是“剩余长度”,并执行相同的重新扫描逻辑,直到找到比剩余后缀长度短的边即可。
3)关于性能的一个重要说明是,在重新扫描期间无需检查每个字符。由于有效的后缀树的构建方式,我们可以安全地假设字符匹配。因此,您主要是在计算长度,并且当我们跳到新的边缘时,唯一需要进行字符等效性检查,因为边缘由其第一个字符标识(在给定节点的上下文中始终是唯一的)。这意味着“重新扫描”逻辑与全字符串匹配逻辑(即在树中搜索子字符串)不同。
4)此处描述的原始后缀链接只是可能的方法之一。例如NJ Larsson等。将该方法命名为“ 面向节点的自上而下”,并将其与“ 面向节点的自下而上”和两个“面向边缘的”方法进行比较。不同的方法具有不同的典型和最坏情况下的性能,要求,限制等,但是通常看来,面向边缘的方法是对原始方法的整体改进。
我的直觉如下:
在主循环的k次迭代之后,您构建了一个后缀树,其中包含以前k个字符开头的完整字符串的所有后缀.
在开始时,这意味着后缀树包含表示整个字符串的单个根节点(这是从0开始的唯一后缀).
在len(字符串)迭代之后,您有一个包含所有后缀的后缀树.
在循环期间,键是活动点.我的猜测是,这代表后缀树中最深的一点,它对应于字符串前k个字符的正确后缀.(我认为正确意味着后缀不能是整个字符串.)
例如,假设您已经看过字符'abcabc'.活动点将表示树中对应于后缀"abc"的点.
活动点由(origin,first,last)表示.这意味着您当前位于树中的点,从节点原点开始,然后以字符串[first:last]中的字符输入
添加新角色时,您会查看活动点是否仍在现有树中.如果是,那么你就完成了.否则,您需要将新节点添加到活动点的后缀树,回退到下一个最短匹配,然后再次检查.
注1:后缀指针为每个节点提供下一个最短匹配的链接.
注意2:添加新节点和回退时,为新节点添加新的后缀指针.此后缀指针的目标将是缩短活动点的节点.此节点将已存在,或者在此回退循环的下一次迭代中创建.
注3:标准化部分只是节省了检查活动点的时间.例如,假设您总是使用origin = 0,并且只是首先和最后一次更改.要检查活动点,每次沿着所有中间节点都必须遵循后缀树.通过仅记录距离最后一个节点的距离来缓存跟踪此路径的结果是有意义的.
你能给出一个"修复"边界变量的代码示例吗?
健康警告:我也发现这个算法特别难以理解,所以请认识到这种直觉可能在所有重要细节中都是错误的......