如何比较集群？

Question

如何比较集群？

Jen*_*Jen 5 python algorithm cluster-analysis

希望可以用python完成！我在相同的数据上使用了两个集群程序，而现在两个都有一个集群文件。我重新格式化了文件，使它们看起来像这样：

Cluster 0:
Brucellaceae(10)
    Brucella(10)
        abortus(1)
        canis(1)
        ceti(1)
        inopinata(1)
        melitensis(1)
        microti(1)
        neotomae(1)
        ovis(1)
        pinnipedialis(1)
        suis(1)
Cluster 1:
    Streptomycetaceae(28)
        Streptomyces(28)
            achromogenes(1)
            albaduncus(1)
            anthocyanicus(1)

etc.

Run Code Online (Sandbox Code Playgroud)

这些文件包含细菌种类信息。因此，我有了簇号（簇0），然后是它的“家族”（布鲁氏菌科）正下方，以及那个家族中的细菌数（10）。在此之下的是该科中发现的属（名称后跟数字Brucella（10）），最后是每个属中的物种（abortus（1）等）。

我的问题：我用这种方式格式化了2个文件，并希望编写一个程序来查找两者之间的差异。唯一的问题是两个程序的群集方式不同，因此即使实际的“群集号”不同，两个群集也可能相同（因此，一个文件中群集1的内容可能与另一个文件中群集43的匹配，唯一不同的是实际群集号）。因此，我需要采取一些措施来忽略群集编号，并专注于群集内容。

有什么办法可以比较这两个文件来检查差异吗？可能吗？任何想法将不胜感激！

Answer 1

Ste*_*ski 1

鉴于：

file1 = '''Cluster 0:
 giant(2)
  red(2)
   brick(1)
   apple(1)
Cluster 1:
 tiny(3)
  green(1)
   dot(1)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')
file2 = '''Cluster 18:
 giant(2)
  red(2)
   brick(1)
   tomato(1)
Cluster 19:
 tiny(2)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')

Run Code Online (Sandbox Code Playgroud)

这是你需要的吗？

def parse_file(open_file):
    result = []

    for line in open_file:
        indent_level = len(line) - len(line.lstrip())
        if indent_level == 0:
            levels = ['','','']
        item = line.lstrip().split('(', 1)[0]
        levels[indent_level - 1] = item
        if indent_level == 3:
            result.append('.'.join(levels))
    return result

data1 = set(parse_file(file1))
data2 = set(parse_file(file2))

differences = [
    ('common elements', data1 & data2),
    ('missing from file2', data1 - data2),
    ('missing from file1', data2 - data1) ]

Run Code Online (Sandbox Code Playgroud)

要查看差异：

for desc, items in differences:
    print desc
    print 
    for item in items:
        print '\t' + item
    print

Run Code Online (Sandbox Code Playgroud)

印刷

common elements

    giant.red.brick
    tiny.blue.candy
    tiny.blue.flower

missing from file2

    tiny.green.dot
    giant.red.apple

missing from file1

    giant.red.tomato

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，7 月前
查看次数：	1686 次
最近记录：	6 年，3 月前