使用Python计算组中任意两个字符串之间的最大距离

Jen*_*Jen 6 python difference

我的问题是如何计算任何两个对应于某个组的字符串之间的最大距离.我文件中的每一行都以"组号"开头,后跟一个长字符串.我想知道,对于每个组,每组中任何两个字符串之间的最大距离是多少.下面是我正在使用的文件类型(字符串已缩短).请注意,组不一定是有序的,我的一些组只有一个与之关联的字符串,所以我想跳过它们(下面的例子中的组'3'):

 0 GCAGACGGGUGAGUAACGCGUGGGAACGUACCAUUUGCUACGGAAUAACUCAGG
 0 GCAGACGGGUGAGUAACGCGUGGGAACGUACCAUUUGCUACGGAAUAACUCAGG
 1 CGAACGGGUGAGUAACACGUGGGCAAUCUGCCCUGCACUCUGGGACAAGCCCUG
 1 CGAACGGGUGAGUAACACGUGGGCAAUCUGCCCUGCACUCUGGGACAAGCCCUG
 1 CGAACGGGUGAGUAACACGUGGGCAAUCUGCCCUGCACUCUGGGACAAGCCCUG
 2 GCCCUUCGGGGUACUCGAGUGGCGAACGGGUGAGUAACACGUGGGUGAUCUGCC
 2 GCCCUUCGGGGUACUCGAGUGGCGAACGGGUGAGUAACACGUGGGUGAUCUGCC
 2 GCCCUUCGGGGUACUCGAGUGGCGAACGGGUGAGUAACACGUGGGUGAUCUGCC
 0 GCAGACGGGUGAGUAACGCGUGGGAACGUACCAUUUGCUACGGAAUAACUCAGG
 0 GCAGACGGGUGAGUAACGCGUGGGAACGUACCAUUUGCUACGGAAUAACUCAGG
 3 GCAGACGGGUGAGUAACAAAAAGGAACGUACCAUUUGCUACGGAAUAACUCAGG
Run Code Online (Sandbox Code Playgroud)

我想创建一些能够创建如下输出的东西:

 Group0 = 0
 Group1 = 1.2
 Group2 = 2.1

 Average = 1.1
Run Code Online (Sandbox Code Playgroud)

此输出将为我提供组编号,然后是该组的最大差异.还有所有组之间最大差异的总体平均值(再次跳过只有一个与之关联的字符串的组):

我的真实文件有大约5000个组,我正在比较的字符串长约400个字符.

我想我可以通过查看这个问题开始解决这个问题,但我不确定如何只计算同一组中字符串的百分比差异,避免只有一个字符串的组,并计算所有组的总体平均百分比差异.非常感谢任何帮助,非常感谢您的任何想法!

编辑:这是我正在使用的文件中的一些截断行.'group'数字的范围从0到~6000.字母串实际上是426个字符长.文件格式为[数字] [空格] [字母串] [行尾字符]

7 UGGCGAACGGGUGAGUAAC
35 GUGGGGAUUAGUGGCGAAC
50 AAACGAGAUGUAGCAAUAC
82 GGAGAGAGCUUGCUCUCUU
479 UCAGGAGCUUGCUCCUGU
46 CGAGGAGCUUGCUCCUUU
24 AACUGGGUCUAAUACCUU

πόδ*_*κύς 5

您也可以尝试从标准库中使用difflib的SequenceMatcher:

>>> import difflib
>>> from itertools import groupby, combinations

>>> def find_max_ratio(lines):
    lines = [row.split() for row in lines]  # the file should already break at each line break
    lines = [(int(row[0]), row[1]) for row in lines]
    lines = groupby(sorted(lines), lambda x: x[0])  # combine strings into their respective groups, sorting them first on int of first element
    group_max = dict()
    for group in lines:
        strings = list(group[1])  # need to convert group[1] from iterator into list
        if len(strings) > 1:  # if the number of strings is 1, then there is nothing to compare the string with in its group
            similarity = 1
            for line1, line2 in combinations(strings, 2):
                s = difflib.SequenceMatcher(None, line1[1], line2[1])  # need to compare second element in each list and exclude the first element (which is the group number)
                similarity = s.ratio() if s.ratio() < similarity else similarity
            group_max[line1[0]] = 1 - similarity  # gives difference ratio
    return group_max

>>> t = open('test.txt')
>>> print find_max_ratio(t)  # it appears that your examples don't have any differences
{'1': 0, '0': 0, '2': 0}
Run Code Online (Sandbox Code Playgroud)

然后,您可以按如下方式计算平均值:

>>> max_ratios = find_max_ratio(t)
>>> average = sum(max_ratios.values())/float(len(max_ratios))
>>> average
0.0  # there are no differences in your test data above
Run Code Online (Sandbox Code Playgroud)

编辑:写入文件

>>> output = sorted(max_ratios.items(), key=lambda x: x[1], reverse=True)  # sorting by descending ratios
>>> with open('test2.txt', 'w') as f:  # a new file name
>>>     f.write('\n'.join([group + ': ' + str(ratio) for group, ratio in output])
                + '\n\nAverage: ' + str(average))
Run Code Online (Sandbox Code Playgroud)

编辑2:添加最小差异

您可以在结果中添加最小差异(这里以这样的元组形式(<max_difference>, <min_difference>):

def find_maxmin_ratios(lines):
    lines = [row.split() for row in lines]  # the file should already break at each line break
    lines = [(int(row[0]), row[1]) for row in lines]
    lines = groupby(sorted(lines), lambda x: x[0])  # combine strings into their respective groups, sorting them first on int of first element
    group_minmax = dict()
    for index, group in lines:
        strings = list(group)  # need to convert group[1] from iterator into list
        if len(strings) > 1:  # if the number of strings is 1, then there is nothing to compare the string with in its group
            max_similarity = 1
            min_similarity = 0
            for line1, line2 in combinations(strings, 2):
                s = difflib.SequenceMatcher(None, line1[1], line2[1])  # need to compare second element in each list and exclude the first element (which is the group number)
                max_similarity = s.ratio() if s.ratio() < max_similarity else max_similarity
                min_similarity = s.ratio() if s.ratio() > min_similarity else min_similarity
            group_minmax[index] = (1 - max_similarity, 1 - min_similarity)  # gives max difference ratio and then min difference ratio
    return group_minmax
Run Code Online (Sandbox Code Playgroud)

然后你可以找到这样的平均值:

>>> t = open('test.txt')
>>> maxmin_ratios = find_maxmin_ratios(t)
>>> maxmin_ratios
{'1': (0, 0.0), '0': (0, 0.0), '2': (0, 0.0)}  # again, no differences in your test data
>>> average_max = sum([maxmin[0] for maxmin in maxmin_ratios.values()])/float(len(maxmin_ratios))
>>> average_min = sum([maxmin[1] for maxmin in maxmin_ratios.values()])/float(len(maxmin_ratios))
>>> average_max, average_min
(0.0, 0.0)  # no differences in your test data
Run Code Online (Sandbox Code Playgroud)

编辑3:优化问题

最后,根据您的上一条评论,我不确定您是否能够以目前的形式过度优化此功能.如果您的计算机无法处理它,您可能需要处理较小的文本块,然后在最后编译结果.difflib不需要大量的内存,但它做了很多工作.你的表现应该比我的好多了(取决于你的机器),因为我的每一行都是随机的.如果你的线条比不相似的线条更相似,你应该做得更好.以下是我的机器上cProfile的结果,用于以下场景(总共3.172小时):

text2.txt
- 9700 lines of text
- each line begins with one random number (1 to 10)
- each line has 400 random characters that follow the random number  # if your data is not random, you should do CONSIDERABLY better than this
Run Code Online (Sandbox Code Playgroud)

请注意,cumtime的大部分时间(给定函数的总时间及其下面的所有函数)都花费在difflib中,而difflib在当前函数的控制范围之外.事实上,该功能的其余部分只需要很少的时间.

4581938093 function calls in 11422.852 seconds

   Ordered by: tottime  # the total time spent in a given function, excluding time spent in subfunctions

ncalls  tottime percall cumtime percall filename:lineno(function)
81770876    8579.568    0   9919.636    0   difflib.py:350(find_longest_match)
-724102230  1268.238    0   1268.238    0   {method 'get' of 'dict' objects}
4700900 874.878 0   1143.419    0   difflib.py:306(__chain_b)
9401960 160.366 0   10183.511   0.001   difflib.py:460(get_matching_blocks)
2060343126  141.242 0   141.242 0   {method 'append' of 'list' objects}
1889761800  110.013 0   110.013 0   {method 'setdefault' of 'dict' objects}
81770876    32.433  0   55.41   0   <string>:8(__new__)
130877001   32.061  0   32.061  0   {built-in method  __new__ of type object at 0x1E228030}
81770876    29.773  0   29.773  0   {method 'pop' of 'list' objects}
1   23.259  23.259  11422.852   11422.852   <pyshell#50>:1(find_maxmin_ratios)
49106125    21.45   0   33.218  0   <string>:12(_make)
9401960 20.539  0   10239.234   0.001   difflib.py:636(ratio)
335752019   17.719  0   17.719  0   {len}
9401960 17.607  0   30.829  0   {_functools.reduce}
4700900 16.778  0   49.996  0   {map}
230344786   16.42   0   16.42   0   {method  __contains__' of 'set' objects}
191093877   14.962  0   14.962  0   {method 'add' of 'set' objects}
98214517    13.222  0   13.222  0   difflib.py:658(<lambda>)
4700900 6.428   0   6.428   0   {method 'sort' of 'list' objects}
4700900 5.794   0   5.794   0   {method 'items' of 'dict' objects}
4700900 5.339   0   1148.758    0   difflib.py:261(set_seq2)
4700900 4.333   0   1160.351    0   difflib.py:154(__init__)
4700900 3.83    0   1156.018    0   difflib.py:223(set_seqs)
4700900 3.43    0   3.43    0   difflib.py:235(set_seq1)
9401960 3.162   0   3.162   0   difflib.py:41(_calculate_ratio)
9700    0.003   0   0.003   0   {method 'strip' of 'str' objects}
1   0.003   0.003   0.003   0.003   {sorted}
9700    0.001   0   0.001   0   <pyshell#50>:3(<lambda>)
1   0   0   11422.852   11422.852   <string>:1(<module>)
1   0   0   0   0   {method 'disable' of '_lsprof.Profiler' objects}
Run Code Online (Sandbox Code Playgroud)

如果你的机器可以处理它,我会运行这个功能,并准备等待两三个小时.这里发生了很多事情,以便逐个字符地比较这些字符串.