use*_*922 5 python cluster-analysis python-3.x scikit-learn
我正在尝试为聚类实现自定义距离度量。代码片段如下所示:
import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
vectorized_text = np.stack([[1, 0, 0, 1] * 100,
[1, 1, 1, 0] * 100,
[0, 1, 1, 0] * 100,
[0, 0, 0, 1] * 100] * 100)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)
Run Code Online (Sandbox Code Playgroud)
是vectorized_text一个大小为 的 one-hot 编码特征矩阵n_sample x n_features。但是当custom_metric被调用时,其中一个x或y变成实值向量,而另一个仍然是独热向量。预计,x和都y应该是 one-hot 向量。这导致 custom_metric 在运行时返回错误结果,因此聚类不正确。
xand yin方法的示例distance(x, y):
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Run Code Online (Sandbox Code Playgroud)
两者都应该是 one-hot 向量。
有人有解决这种情况的想法吗?
我不明白你的问题,如果我有:
x = [1, 0, 1]
y = [0, 0, 1]
Run Code Online (Sandbox Code Playgroud)
我用:
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
print(distance(x, y))
1.0
Run Code Online (Sandbox Code Playgroud)
如果现在打印 x, y,则在顶部:
x
[1, 0, 1]
y
[0, 0, 1]
Run Code Online (Sandbox Code Playgroud)
所以它有效吗?
| 归档时间: |
|
| 查看次数: |
4213 次 |
| 最近记录: |