sklearn 中使用自定义距离度量进行聚类

use*_*922 5 python cluster-analysis python-3.x scikit-learn

我正在尝试为聚类实现自定义距离度量。代码片段如下所示:

import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift

def distance(x, y):
    # print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

def custom_metric(x, y):
    # x, y are two vectors
    # distance(.,.) calculates count of elements when both xi and yi are True
    return distance(x, y)


vectorized_text = np.stack([[1, 0, 0, 1] * 100,
                            [1, 1, 1, 0] * 100,
                            [0, 1, 1, 0] * 100,
                            [0, 0, 0, 1] * 100] * 100)

dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)
Run Code Online (Sandbox Code Playgroud)

vectorized_text一个大小为 的 one-hot 编码特征矩阵n_sample x n_features。但是当custom_metric被调用时,其中一个xy变成实值向量,而另一个仍然是独热向量。预计,x和都y应该是 one-hot 向量。这导致 custom_metric 在运行时返回错误结果,因此聚类不正确。

xand yin方法的示例distance(x, y)

x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Run Code Online (Sandbox Code Playgroud)

两者都应该是 one-hot 向量。

有人有解决这种情况的想法吗?

PV8*_*PV8 1

我不明白你的问题,如果我有:

x = [1, 0, 1]
y = [0, 0, 1]
Run Code Online (Sandbox Code Playgroud)

我用:

def distance(x, y):
    # print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

print(distance(x, y))
 1.0
Run Code Online (Sandbox Code Playgroud)

如果现在打印 x, y,则在顶部:

x
[1, 0, 1]
y
[0, 0, 1]
Run Code Online (Sandbox Code Playgroud)

所以它有效吗?