Python中经常使用的单词

sha*_*ida 4 python bioinformatics string-matching

如何编写代码以找到最常见的2-mer"GATCCAGATCCCCATAC".我写了这段代码,但似乎我错了,请帮忙纠正我.

def PatternCount(Pattern, Text):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count
Run Code Online (Sandbox Code Playgroud)

此代码在字符串中打印最频繁的k-mer,但它不会给出给定字符串中的2-mer.

MMF*_*MMF 5

您可以先定义一个函数来获取字符串中的所有k-mer:

def get_all_k_mer(string, k=1):
   length = len(string)
   return [string[i: i+ k] for i in xrange(length-k+1)]
Run Code Online (Sandbox Code Playgroud)

然后你可以collections.Counter用来计算每个k-mer的重复次数:

>>> from collections import Counter
>>> s = 'GATCCAGATCCCCATAC'
>>> Counter(get_all_k_mer(s, k=2))
Run Code Online (Sandbox Code Playgroud)

输出:

Counter({'AC': 1,
         'AG': 1,
         'AT': 3,
         'CA': 2,
         'CC': 4,
         'GA': 2,
         'TA': 1,
         'TC': 2})
Run Code Online (Sandbox Code Playgroud)

另一个例子 :

>>> s = "AAAAAA"
>>> Counter(get_all_k_mer(s, k=3))
Run Code Online (Sandbox Code Playgroud)

输出:

Counter({'AAA': 4})
# Indeed : AAAAAA
           ^^^     -> 1st time
            ^^^    -> 2nd time
             ^^^   -> 3rd time
               ^^^ -> 4th time
Run Code Online (Sandbox Code Playgroud)