哪种编程结构用于聚类算法

And*_*rej 12 python cluster-analysis hierarchical-clustering data-structures

我正在尝试实现以下(分裂)聚类算法(下面是算法的简短形式,完整描述可在此处获得):

从样本x开始,i = 1,...,n被视为n个数据点的单个簇和针对所有点对定义的相异度矩阵D. 修复阈值T以决定是否拆分集群.

  1. 首先确定所有数据点对之间的距离,并选择它们之间具有最大距离(Dmax)的对.

  2. 将Dmax与T进行比较.如果Dmax> T,则使用所选对作为两个新簇中的第一个元素,将单个簇分成两个.剩下的n - 2个数据点被放入两个新集群中的一个.如果D(x_i,x_l)<D(x_j,x_l),则将x_l添加到包含x_i的新集群中,否则将其添加到包含x_i的新集群中.

  3. 在第二阶段,在两个新簇之一中找到值D(x_i,x_j)以找到簇中具有它们之间的最大距离Dmax的对.如果Dmax <T,则群集的划分停止,并考虑另一个群集.然后,对从该迭代生成的聚类重复该过程.

输出是群集数据记录的层次结构.我恳请一下如何实现聚类算法的建议.

编辑1:我附加了定义距离(相关系数)的Python函数和在数据矩阵中找到最大距离的函数.

# Read data from GitHub
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0)
data = df.values.tolist()
data = data[1:10]

# Define correlation coefficient as distance of choice
def pearson(v1, v2):
  # Simple sums
  sum1 = sum(v1)
  sum2 = sum(v2)
  # Sums of the squares
  sum1Sq = sum([pow(v, 2) for v in v1])
  sum2Sq = sum([pow(v, 2) for v in v2]) 
  # Sum of the products
  pSum=sum([v1[i] * v2[i] for i in range(len(v1))])
  # Calculate r (Pearson score)
  num = pSum - (sum1 * sum2 / len(v1))
  den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
  if den == 0: return 0
  return num / den


# Find largest distance
dist={}
max_dist = pearson(data[0], data[0])
# Loop over upper triangle of data matrix
for i in range(len(data)):
  for j in range(i + 1, len(data)):
     # Compute distance for each pair
     dist_curr = pearson(data[i], data[j])
     # Store distance in dict
     dist[(i, j)] = dist_curr
     # Store max distance
     if dist_curr > max_dist:
       max_dist = dist_curr
Run Code Online (Sandbox Code Playgroud)

编辑2:下面粘贴的是Dschoni答案的功能.

# Euclidean distance
def euclidean(x,y):
  x = numpy.array(x)
  y = numpy.array(y) 
  return numpy.sqrt(numpy.sum((x-y)**2))

# Create matrix
def dist_mat(data):
  dist = {}
  for i in range(len(data)):
    for j in range(i + 1, len(data)):
      dist[(i, j)] = euclidean(data[i], data[j])
  return dist


# Returns i & k for max distance
def my_max(dict):
    return max(dict)

# Sort function
list1 = []
list2 = []
def sort (rcd, i, k):
  list1.append(i)
  list2.append(k)
  for j in range(len(rcd)):
    if (euclidean(rcd[j], rcd[i]) < euclidean(rcd[j], rcd[k])):
      list1.append(j)
    else:
      list2.append(j)
Run Code Online (Sandbox Code Playgroud)

编辑3: 当我运行@Dschoni提供的代码时,算法按预期工作.然后我修改了create_distance_list函数,以便计算多变量数据点之间的距离.我使用欧氏距离.对于玩具示例,我加载iris数据.我只聚集数据集的前50个实例.

import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, sep = ',')
df = df.drop(4, 1)
df = df[1:50]
data = df.values.tolist()

idl=range(len(data))
dist = create_distance_list(data)
print sort(dist, idl)
Run Code Online (Sandbox Code Playgroud)

结果如下:

[[24],[17],[4],[7],[40],[13],[14],[15],[26,27,38],[3,16,39],[ 25],[42],[18,20,45],[43],[1,2,11,46],[12,37,41],[5],[21],[22],[ 10,23,28,29],[6,34,48],[0,8,33,36,44],[31],[32],[19],[30],[35],[ 9,47]]

一些数据点仍然聚集在一起.我通过actualsort函数中向字典添加少量数据噪声来解决这个问题:

# Add small random noise
for key in actual:    
  actual[key] +=  np.random.normal(0, 0.005)
Run Code Online (Sandbox Code Playgroud)

知道如何正确解决这个问题吗?

Dsc*_*oni 5

欧几里德距离的正确工作示例:

import numpy as np
#For random number generation


def create_distance_list(l):
'''Create a distance list for every
unique tuple of pairs'''
    dist={}
    for i in range(len(l)):
        for k in range(i+1,len(l)):
            dist[(i,k)]=abs(l[i]-l[k])
    return dist

def maximum(distance_dict):
'''Returns the key of the maximum value if unique
or a random key with the maximum value.'''
    maximum = max(distance_dict.values())
    max_key = [key for key, value in distance_dict.items() if value == maximum]
    if len(max_key)>1:
        random_key = np.random.random_integers(0,len(max_key)-1)
        return (max_key[random_key],)
    else:
        return max_key

def construct_new_dict(distance_dict,index_list):
'''Helper function to create a distance map for a subset
of data points.'''
    new={}
    for i in range(len(index_list)):
        for k in range(i+1,len(index_list)):
            m = index_list[i]
            n = index_list[k]
            new[(m,n)]=distance_dict[(m,n)]
    return new

def sort(distance_dict,idl,threshold=4):
    result=[idl]
    i=0
    try:
        while True:
            if len(result[i])>=2:
            actual=construct_new_dict(dist,result[i]) 
                act_max=maximum(actual)
                if distance_dict[act_max[0]]>threshold:
                    j = act_max[0][0]
                    k = act_max[0][1]
                    result[i].remove(j)
                    result[i].remove(k)
                    l1=[j]
                    l2=[k]
                    for iterr in range(len(result[i])):
                        s = result[i][iterr]
                        if s>j:
                            c1=(j,s)
                        else:
                            c1=(s,j)
                        if s>k:
                            c2=(k,s)
                        else: 
                            c2=(s,k)
                        if actual[c1]<actual[c2]:
                            l1.append(s)
                        else:
                            l2.append(s)
                    result.remove(result[i])
    #What to do if distance is equal?
                    l1.sort()
                    l2.sort()
                    result.append(l1)
                    result.append(l2)
                else:
                    i+=1
            else:
                i+=1
    except:
        return result


#This is the dataset
a = [1,2,2.5,5]
#Giving each entry a unique ID
idl=range(len(a))
dist = create_distance_list(a)
print sort(dist,idl)  
Run Code Online (Sandbox Code Playgroud)

我编写了可读性代码,有很多东西可以更快,更可靠,更漂亮.这只是为了让您了解如何完成它.