Python,巨大的迭代性能问题

9 python iteration bioinformatics

我正在做3个单词的迭代,每个单词大约有500万个字符,我想找到20个字符的序列来识别每个单词.也就是说,我想在一个单词中找到长度为20的所有序列,这个序列对于该单词是唯一的.我的问题是我写的代码需要很长时间才能运行.我甚至没有完成一个单词来运行我的程序过夜.

下面的函数采用包含字典的列表,其中每个字典包含20个可能的单词,以及500万个单词之一的位置.

如果有人知道如何优化这个我会非常感激,我不知道如何继续......

这是我的代码示例:

def findUnique(list):
    # Takes a list with dictionaries and compairs each element in the dictionaries
    # with the others and puts all unique element in new dictionaries and finally
    # puts the new dictionaries in a list.
    # The result is a list with (in this case) 3 dictionaries containing all unique
    # sequences and their locations from each string.
    dicList=[]
    listlength=len(list)
    s=0
    valuelist=[]
    for i in list:
        j=i.values()
        valuelist.append(j)
    while s<listlength:
        currdic=list[s]
        dic={}
        for key in currdic:
            currval=currdic[key]
            test=True
            n=0
            while n<listlength:
                if n!=s:
                    if currval in valuelist[n]: #this is where it takes to much time
                        n=listlength
                        test=False
                    else:
                        n+=1
                else:
                    n+=1
            if test:
                dic[key]=currval
        dicList.append(dic)
        s+=1
    return dicList
Run Code Online (Sandbox Code Playgroud)

小智 10

def slices(seq, length, prefer_last=False):
  unique = {}
  if prefer_last: # this doesn't have to be a parameter, just choose one
    for start in xrange(len(seq) - length + 1):
      unique[seq[start:start+length]] = start
  else: # prefer first
    for start in xrange(len(seq) - length, -1, -1):
      unique[seq[start:start+length]] = start
  return unique

# or find all locations for each slice:
import collections
def slices(seq, length):
  unique = collections.defaultdict(list)
  for start in xrange(len(seq) - length + 1):
    unique[seq[start:start+length]].append(start)
  return unique
Run Code Online (Sandbox Code Playgroud)

这个函数(当前在我的iter_util模块中)是O(n)(n是每个单词的长度),你可以使用set(slices(..))(使用诸如差异之类的集合操作)来获得所有单词中唯一的切片(例如下面的例子).如果您不想跟踪位置,也可以编写函数来返回一个集合.内存使用率会很高(尽管仍然是O(n),只是一个很大的因素),可能通过一个特殊的"延迟切片"类(存储基本序列(字符串)加上)来缓解(尽管长度不超过20)开始和停止(或开始和长度).

打印独特的切片:

a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Run Code Online (Sandbox Code Playgroud)

包括地点:

a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
               # (depending on which slices function used)
Run Code Online (Sandbox Code Playgroud)

在更接近您条件的测试脚本中,使用随机生成的5m字符和切片长度为20的单词,内存使用率非常高,以至于我的测试脚本很快达到了我的1G主内存限制并开始颠倒虚拟内存.那时,Python花了很少的时间在CPU上,我杀了它.减少切片长度或字长(因为我使用完全随机的单词减少重复并增加内存使用)以适应主内存并且运行不到一分钟.这种情况加上原始代码中的O(n**2)将永远存在,这也是算法时间和空间复杂性都很重要的原因.

import operator
import random
import string

def slices(seq, length):
  unique = {}
  for start in xrange(len(seq) - length, -1, -1):
    unique[seq[start:start+length]] = start
  return unique

def sample_with_repeat(population, length, choice=random.choice):
  return "".join(choice(population) for _ in xrange(length))

word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
                              (x for x in words_slices_sets if x is not n),
                              n)
                       for n in words_slices_sets]
print [len(x) for x in unique_words_slices]
Run Code Online (Sandbox Code Playgroud)