任意字符串与面向字符的多序列比较

use*_*314 7 python string algorithm character levenshtein-distance

核心问题是:

我正在寻找一种算法来计算一组字符串之间的最大简约距离.对于距离,我的意思是类似于Damerau-Levenshtein距离,即字符或相邻字符块的删除,插入,替换和转置的最小数量.但是我不想使用常规字符串来调查带有方向字符的字符串.

因此,字符串可能如下所示:

  1. (A,1) (B,1) (C,1) (D,1)

可能的衍生物可能是:

  1. (A,1) (C,0) (B,0) (D,1)
  2. (A,1) (C,1) (B,1) (D,1)
  3. (A,1) (B,0) (C,0) (D,1)

A,B,C,D字符身份在哪里1 = forward0 = reverse.

这里,导数1.将具有距离2,因为你可以切出块BC并重新粘贴它(1切,1贴).衍生物2.也会有2个,因为你可以剪掉C并重新粘贴在B前面(1个剪切,1个粘贴),而数字3则需要4个操作(2个剪切,2个粘贴)进行转换.类似,删除或插入块将产生距离1.

如果要为所有可能的X 定义(X,0)(X,1)作为两个不同的非面向字符(X0, X1),示例3.将导致距离为2,因为您可以切出块B1C1B0C0分两步插入块.

一个现实世界的例子:

细菌基因组中的基因可以被认为是定向特征(A,0),(B,0)......通过确定序列距离,两个相关细菌中同源基因的基因组方向可以用作进化标记迹线.细菌基因组是圆形字符串的事实引入了额外的边界条件ABC等于BCA.

真正的基因组确实具有独特的基因,在伴侣中没有相应的基因,从而产生了一个占位符字符@.这些占位符将比较的信息内容减少到下限,因为例如(A,1)(B,1)@(C,1)可以转换为(A,1)@@@(B,1) @(C,1)通过插入块@@@.然而,方向部分恢复信息内容,因为您可能会发现(A,1)@@@(B,0)@(C,1)表示最小距离为3.更好的是比较多个相关序列的算法(同时,因为你可以在进化历史中找到中间体,从而提高分辨率.

我意识到,在文本字符串比较中已经发布了几个问题.但它们无法轻易扩展以包括方向.此外,存在大量处理生物序列的方法,特别是用于多序列分析.然而,它们仅限于在交替方向上不存在的大分子序列,并且通常为任何特定的字符匹配调用特定的权重.

如果已经有一个python库可以进行必要的自定义来解决这个问题,那就太棒了.但任何合适的方向感知算法都会非常有用.

Gam*_*iac 1

我相信下面的代码可以帮助你:

from itertools import permutations
from random import randint
from pprint import pprint


def generate_genes():
    """
    Generates a boilerplate list of genes
    @rtype : list
    """
    tuple_list = []

    for i in range(16):

        binary_var = bin(i)[2:]

        if len(binary_var) != 4:
            binary_var = "0" * (4 - len(binary_var)) + binary_var

        tuple_list.append([('A', (1 if binary_var[0] == '1' else 0)),
                           ('B', (1 if binary_var[1] == '1' else 0)),
                           ('C', (1 if binary_var[2] == '1' else 0)),
                           ('D', (1 if binary_var[3] == '1' else 0))])

    return tuple_list


def all_possible_genes():
    """ Generates all possible combinations of ABCD genes
    @return: returns a list of combinations
    @rtype: tuple
    """
    gene_list = generate_genes()
    all_possible_permutations = []
    for gene in gene_list:
        all_possible_permutations.append([var for var in permutations(gene)])

    return all_possible_permutations


def gene_stringify(gene_tuple):
    """
    @type gene_tuple : tuple
    @param gene_tuple: The gene tuple generated
    """

    return "".join(str(var[0]) for var in gene_tuple if var[1])


def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance between sequences.

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]


if __name__ == '__main__':
    genes = all_possible_genes()
    list1 = genes[randint(0, 15)][randint(0, 23)]
    list2 = genes[randint(0, 15)][randint(0, 23)]

    print gene_stringify(list1)
    pprint(list1)
    print gene_stringify(list2)
    pprint(list2)
    print dameraulevenshtein(gene_stringify(list1), gene_stringify(list2))
Run Code Online (Sandbox Code Playgroud)

制作人员

Michael Homer 的算法