9 python iteration bioinformatics
我正在做3个单词的迭代,每个单词大约有500万个字符,我想找到20个字符的序列来识别每个单词.也就是说,我想在一个单词中找到长度为20的所有序列,这个序列对于该单词是唯一的.我的问题是我写的代码需要很长时间才能运行.我甚至没有完成一个单词来运行我的程序过夜.
下面的函数采用包含字典的列表,其中每个字典包含20个可能的单词,以及500万个单词之一的位置.
如果有人知道如何优化这个我会非常感激,我不知道如何继续......
这是我的代码示例:
def findUnique(list):
# Takes a list with dictionaries and compairs each element in the dictionaries
# with the others and puts all unique element in new dictionaries and finally
# puts the new dictionaries in a list.
# The result is a list with (in this case) 3 dictionaries containing all unique
# sequences and their locations from each string.
dicList=[]
listlength=len(list)
s=0
valuelist=[]
for i in list:
j=i.values()
valuelist.append(j)
while s<listlength:
currdic=list[s]
dic={}
for key in currdic:
currval=currdic[key]
test=True
n=0
while n<listlength:
if n!=s:
if currval in valuelist[n]: #this is where it takes to much time
n=listlength
test=False
else:
n+=1
else:
n+=1
if test:
dic[key]=currval
dicList.append(dic)
s+=1
return dicList
Run Code Online (Sandbox Code Playgroud)
小智 10
def slices(seq, length, prefer_last=False):
unique = {}
if prefer_last: # this doesn't have to be a parameter, just choose one
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]] = start
else: # prefer first
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
# or find all locations for each slice:
import collections
def slices(seq, length):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]].append(start)
return unique
Run Code Online (Sandbox Code Playgroud)
这个函数(当前在我的iter_util模块中)是O(n)(n是每个单词的长度),你可以使用set(slices(..))(使用诸如差异之类的集合操作)来获得所有单词中唯一的切片(例如下面的例子).如果您不想跟踪位置,也可以编写函数来返回一个集合.内存使用率会很高(尽管仍然是O(n),只是一个很大的因素),可能通过一个特殊的"延迟切片"类(存储基本序列(字符串)加上)来缓解(尽管长度不超过20)开始和停止(或开始和长度).
打印独特的切片:
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Run Code Online (Sandbox Code Playgroud)
包括地点:
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
# (depending on which slices function used)
Run Code Online (Sandbox Code Playgroud)
在更接近您条件的测试脚本中,使用随机生成的5m字符和切片长度为20的单词,内存使用率非常高,以至于我的测试脚本很快达到了我的1G主内存限制并开始颠倒虚拟内存.那时,Python花了很少的时间在CPU上,我杀了它.减少切片长度或字长(因为我使用完全随机的单词减少重复并增加内存使用)以适应主内存并且运行不到一分钟.这种情况加上原始代码中的O(n**2)将永远存在,这也是算法时间和空间复杂性都很重要的原因.
import operator
import random
import string
def slices(seq, length):
unique = {}
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
def sample_with_repeat(population, length, choice=random.choice):
return "".join(choice(population) for _ in xrange(length))
word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
(x for x in words_slices_sets if x is not n),
n)
for n in words_slices_sets]
print [len(x) for x in unique_words_slices]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
723 次 |
| 最近记录: |