如何在sklearn中使用gower距离实现pam聚类算法?

Csa*_*ton 1 cluster-analysis python-3.x scikit-learn

我想使用高尔距离实现 pam ( KMedoid, method='pam' ) 算法。

我的数据集包含混合特征、数字特征和分类特征,一些猫特征有 1000 多个不同的值。

我在这里找到了合适的高尔距离实现:https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py

我的问题是我使用的PAM 的sklearn-extra实现没有metric='gower'实现该选项。所以我尝试创建一个可调用的,但我似乎发现很难将它们连接在一起。

D = gower.gower_matrix(df_ext, cat_features=cat_mask) # cat_mask is a boolean list marking what the 
                                                    categorical features are in the df_ext

# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
def get_gower():
    return sklearn.metrics.pairwise_distances(D, metric='precomputed')

# https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
kmedoids = sklearn_extra.cluster.KMedoids(df_ext, metric=get_gower, method='pam')
kmedoids.fit(df_ext)
Run Code Online (Sandbox Code Playgroud)

我得到这个值错误:

ValueError                                Traceback (most recent call last)
<ipython-input-13-9ae677cd636a> in <module>
      1 # https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
      2 kmedoids = KMedoids(df_ext, metric=get_gower, method='pam')
----> 3 kmedoids.fit(df_ext)

D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in fit(self, X, y)
    183         random_state_ = check_random_state(self.random_state)
    184 
--> 185         self._check_init_args()
    186         X = check_array(X, accept_sparse=["csr", "csc"])
    187         if self.n_clusters > X.shape[0]:

D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_init_args(self)
    154 
    155         # Check n_clusters and max_iter
--> 156         self._check_nonnegative_int(self.n_clusters, "n_clusters")
    157         self._check_nonnegative_int(self.max_iter, "max_iter", False)
    158 

D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_nonnegative_int(self, value, desc, strict)
    144         else:
    145             negative = (value is None) or (value < 0)
--> 146         if negative or not isinstance(value, (int, np.integer)):
    147             raise ValueError(
    148                 "%s should be a nonnegative integer. "

D:\ProgramFiles\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1327 
   1328     def __nonzero__(self):
-> 1329         raise ValueError(
   1330             f"The truth value of a {type(self).__name__} is ambiguous. "
   1331             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Run Code Online (Sandbox Code Playgroud)

我认为我的可调用对象有问题。你知道我做错了什么吗?

ari*_*ics 6

Python 中具有 Gower 度量的 K-medoids (PAM)

  • 数据类型:数值变量和分类变量
  • 与 R 相比的结果
  • 注意:在应用聚类之前考虑缩放数值数据。
import pandas as pd 
import numpy as np
import gower
from sklearn.preprocessing import LabelEncoder
from sklearn_extra.cluster import KMedoids

# Create a dataframe with both numeric and string type columns 

age = [21, 21, 19, 30, 21, 21, 19, 30, 35, 39, 50, 2]
gender = ['M', 'M', 'N', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M']
civil_status = ['MARRIED', 'SINGLE', 'SINGLE', 'SINGLE', 'MARRIED', 'SINGLE', 'WIDOW', 'DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED']
salary = [3000.0, 1200.0 , 32000.0, 1800.0 , 2900.0 , 1100.0 , 10000.0, 1500.0, 200.0, 500.0, 50.0, 5000.0]
available_credit = [2200, 100, 22000, 1100, 2000, 100, 6000, 2200, 6000, 12000, 500, 50]

df_eg = pd.DataFrame({'age': age,
                 'gender': gender,
                  'civil_status': civil_status,
                 'salary': salary,
                 'available_credit': available_credit})
# Label encode categorical variables

df_eg_encoded = df_eg.copy() # Avoid Pandas error
df_eg_encoded[['gender', 'civil_status']] = df_eg_encoded[['gender', 'civil_status']].apply(LabelEncoder().fit_transform)


# Apply Gower distance calculation

gower_mat = gower.gower_matrix(df_eg,  cat_features = [False, True, True, False, False])
# Fit model
km_model = KMedoids(n_clusters = 3, random_state = 0, metric = 'precomputed', method = 'pam', init =  'k-medoids++').fit(gower_mat)  

clusters = km_model.labels_
clusters
> array([1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1], dtype=int64)
Run Code Online (Sandbox Code Playgroud)

R代码

install.packages("clusters")
age <- c(21,21,19, 30,21,21,19,30, 35, 39, 50, 2)
gender <- c('M','M','N','M','F','F','F','F', 'F', 'M', 'F', 'M')
civil_status <- c('MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED')
salary <-c (3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0, 200.0, 500.0, 50.0, 5000.0)
available_credit <- c (2200,100,22000,1100,2000,100,6000,2200, 6000, 12000, 500, 50)
X <- data.frame(age, gender, civil_status, salary, available_credit)
print(X)

library(cluster)
gower_mat <- daisy(X, metric = c("gower"))
pamx <- pam(gower_mat, 3)
print(pamx)
> Clustering vector:
> [1] 1 1 2 1 1 3 3 3 3 1 3 1
Run Code Online (Sandbox Code Playgroud)

参考

https://pypi.org/project/gower/ https://scikit-learn-extra.readthedocs.io/en/stable/ generated/ sklearn_extra.cluster.KMedoids.html https://www.rdocumentation.org/packages /cluster/versions/2.1.2/topics/daisy https://www.rdocumentation.org/packages/cluster/versions/2.1.2/topics/pam