小编bio*_*ojl的帖子

将上三角形复制到python矩阵中的下三角形

       iluropoda_melanoleuca  bos_taurus  callithrix_jacchus  canis_familiaris
ailuropoda_melanoleuca     0        84.6                97.4                44
bos_taurus                 0           0                97.4              84.6
callithrix_jacchus         0           0                   0              97.4
canis_familiaris           0           0                   0                 0

Run Code Online (Sandbox Code Playgroud)

这是我所拥有的python矩阵的简短版本.我在上三角形中有信息.是否有一个简单的功能可以将上三角形复制到矩阵的下三角形？

python matrix

bio*_*ojl

2013 05-09

24
推荐指数

3
解决办法

7896
查看次数

使用Levenshtein-Distance获得子序列的位置

我有大量包含序列的记录('ATCGTGTGCATCAGTTTCGA ...'),最多500个字符.我还有一个较小序列的列表,通常是10-20个字符.我想使用Levenshtein距离来在记录中找到这些较小的序列,允许小的变化或插入(L_distance <= 2).

问题是我还想获得这些较小序列的起始位置,显然它只比较相同长度的序列.

>>> import Levenshtein
>>> s1 = raw_input('first word: ')
first word: ATCGTAATACGATCGTACGACATCGCGGCCCTAGC
>>> s2 = raw_input('second word: ')
first word: TACGAT
>>> Levenshtein.distance(s1,s2)
29

Run Code Online (Sandbox Code Playgroud)

在这个例子中,我想获得位置(7)和距离(在这种情况下为0).

有没有一种简单的方法来解决这个问题,或者我是否必须将较大的序列分解为较小的序列然后为所有这些序列运行Levenshtein距离？这可能需要太多时间.

谢谢.

更新#Naive实现在查找完全匹配后生成所有子字符串.

def find_tag(pattern,text,errors):       
    m = len(pattern)
    i=0
    min_distance=errors+1
    while i<=len(text)-m:
        distance = Levenshtein.distance(text[i:i+m],pattern)
        print text[i:i+m],distance #to see all matches.
        if distance<=errors:
            if distance<min_distance:
                match=[i,distance]
                min_distance=distance
        i+=1
    return match

#Real example. In this case just looking for one pattern, but we have about 50.
import re, Levenshtein

text = …

Run Code Online (Sandbox Code Playgroud)

python algorithm dna-sequence levenshtein-distance

bio*_*ojl

2013 11-01

5
推荐指数

1
解决办法

805
查看次数

如何通过列A在R中创建唯一,并在B列中保留具有最大值的行

我有一个包含多列的data.frame(17).第2列有几行具有相同的值,我想只保留其中一行,特别是第17列中具有最大值的行.

例如:

A    B
'a'  1
'a'  2
'a'  3
'b'  5
'b'  200

Would return
A    B
'a'  3
'b'  200

Run Code Online (Sandbox Code Playgroud)

(加上其他列)

到目前为止,我一直在使用这个独特的功能,但我认为它会随机保留一个或保留第一个出现的功能.

**更新**真实数据有376000行.我已经尝试了data.table和ddply建议,但它们需要永远.任何最有效的想法？

r unique

bio*_*ojl

2013 01-15

4
推荐指数

1
解决办法

1461
查看次数

所有可能组合的最小二乘差异的高性能计算(n列表)

我正在寻找一种非常有效的方法来计算n个列表中的所有可能组合,然后保持组合与最小的最小二乘差异.

我已经有了一个代码可以做到这一点,但是当它达到数百万个组合时,事情变得缓慢.

candidates_len包含长度列表,即[[500,490,510,600] [300,490,520] [305,497,515]] candidate_name包含名称列表的列表,即[['a',' b','c','d'] ['mi','mu','ma'] ['pi','pu','pa']]

两个列表都有n个列表.

#    Creating the possible combinations and store the lists of lengths in vector r
r=[[]]
for x in candidates_len:
    r = [ i + [y] for y in x for i in r ]
#Storing the names of the combinations and store the lists of identifiers in vector z
z=[[]]
for x in candidates_name:
    z = [ i + [y] for y in x for i …

Run Code Online (Sandbox Code Playgroud)

python performance combinations dictionary least-squares

bio*_*ojl

2012 12-06

2
推荐指数

1
解决办法

487
查看次数