sklearn 中使用自定义距离度量进行聚类

Question

sklearn 中使用自定义距离度量进行聚类

use*_*922 5 python cluster-analysis python-3.x scikit-learn

我正在尝试为聚类实现自定义距离度量。代码片段如下所示：

import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift

def distance(x, y):
    # print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

def custom_metric(x, y):
    # x, y are two vectors
    # distance(.,.) calculates count of elements when both xi and yi are True
    return distance(x, y)


vectorized_text = np.stack([[1, 0, 0, 1] * 100,
                            [1, 1, 1, 0] * 100,
                            [0, 1, 1, 0] * 100,
                            [0, 0, 0, 1] * 100] * 100)

dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)

Run Code Online (Sandbox Code Playgroud)

是vectorized_text一个大小为的 one-hot 编码特征矩阵n_sample x n_features。但是当custom_metric被调用时，其中一个x或y变成实值向量，而另一个仍然是独热向量。预计，x和都y应该是 one-hot 向量。这导致 custom_metric 在运行时返回错误结果，因此聚类不正确。

xand yin方法的示例distance(x, y)：

x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]

Run Code Online (Sandbox Code Playgroud)

两者都应该是 one-hot 向量。

有人有解决这种情况的想法吗？

Answer 1

PV8*_*PV8 1

我不明白你的问题，如果我有：

x = [1, 0, 1]
y = [0, 0, 1]

Run Code Online (Sandbox Code Playgroud)

我用：

def distance(x, y):
    # print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

print(distance(x, y))
 1.0

Run Code Online (Sandbox Code Playgroud)

如果现在打印 x, y，则在顶部：

x
[1, 0, 1]
y
[0, 0, 1]

Run Code Online (Sandbox Code Playgroud)

所以它有效吗？

归档时间：	6 年，5 月前
查看次数：	4213 次
最近记录：	6 年，5 月前