aho corasick算法的状态转换表

Sha*_*_42 2 algorithm pattern-matching aho-corasick

请帮助我理解Aho-Corasick算法中多个模式的状态转换表的构造.

请给出一个简单而详细的解释,以便我理解.

我正在关注这篇论文,是动画.

谢谢.

mok*_*mok 9

阶段1

创建关键字树:

Starting at the root, follow the path labeled by chars of Pi
        If the path ends before Pi, continue it by adding new edges and ...
                            nodes for the remaining characters of P
        Store identifier i of Pi at the terminal node of the path

我通过一个例子证明了这一点:

假设我们有一组有限的模式P = {他,她,他的,她的}.

在此输入图像描述

阶段2

接下来,我们将关键字树扩展为自动机,以支持线性时间匹配,如下所示.

显然,自动机由两部分组成:

  • 状态
  • 过渡

状态:正是关键字树实现的节点; 和初始状态=树的根.

在此输入图像描述

转换:使用goto(q,a)函数如下.

/*This function gives the state entered from current state q 
     by matching target char a*/
- If edge (q,v) in the tree is labeled by a, then g(q,a)=v; 
- g(0,a)=0 for each a that does not label an edge out of the root
//the automton stays at the initial state while scanning non-matching characters       
- O.W. g(q,a)={}

在此输入图像描述

使用自动机:

SearchofTarget T[1..m] //considering the target string T as an 
                     array of chars indexed from 1   to m
q:= 0; // initial state(root)
for i:= 1 to m do
    while g(q,T[i]) == {}  do 
        q:= fail(q); // follow a fail
    q:= goto(q,T[i]); // follow a goto
    if output(q) != {} then print i, out(q);
endfor;
Run Code Online (Sandbox Code Playgroud)

正如您在上面的伪代码中看到的,它调用了两个函数:fail(q)output(q)

fail(q):// for q!= 0表示输入不匹配的状态

failure(q) is the node labeled by the longest proper suffix  w of L(q) ...
    s.t. w is a prefix of some pattern
//L(q) is the label of node q as the concatenation of edge labels ...
   on the path from the root to q
Run Code Online (Sandbox Code Playgroud)

输出(q):

gives the set of patterns recognized when entering state q
Run Code Online (Sandbox Code Playgroud)

计算完这两个函数后,自动机看起来像这样:

在此输入图像描述

现在您可能想知道何时计算这些方法,所以请查看以下更多官方表格:

在此输入图像描述

希望这有帮助,如果仍然有些含糊不清,请不要犹豫.