Csa*_*ton 1 cluster-analysis python-3.x scikit-learn
我想使用高尔距离实现 pam ( KMedoid, method='pam' ) 算法。
我的数据集包含混合特征、数字特征和分类特征,一些猫特征有 1000 多个不同的值。
我在这里找到了合适的高尔距离实现:https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py
我的问题是我使用的PAM 的sklearn-extra实现没有metric='gower'实现该选项。所以我尝试创建一个可调用的,但我似乎发现很难将它们连接在一起。
D = gower.gower_matrix(df_ext, cat_features=cat_mask) # cat_mask is a boolean list marking what the
categorical features are in the df_ext
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
def get_gower():
return sklearn.metrics.pairwise_distances(D, metric='precomputed')
# https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
kmedoids = sklearn_extra.cluster.KMedoids(df_ext, metric=get_gower, method='pam')
kmedoids.fit(df_ext)
Run Code Online (Sandbox Code Playgroud)
我得到这个值错误:
ValueError Traceback (most recent call last)
<ipython-input-13-9ae677cd636a> in <module>
1 # https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
2 kmedoids = KMedoids(df_ext, metric=get_gower, method='pam')
----> 3 kmedoids.fit(df_ext)
D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in fit(self, X, y)
183 random_state_ = check_random_state(self.random_state)
184
--> 185 self._check_init_args()
186 X = check_array(X, accept_sparse=["csr", "csc"])
187 if self.n_clusters > X.shape[0]:
D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_init_args(self)
154
155 # Check n_clusters and max_iter
--> 156 self._check_nonnegative_int(self.n_clusters, "n_clusters")
157 self._check_nonnegative_int(self.max_iter, "max_iter", False)
158
D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_nonnegative_int(self, value, desc, strict)
144 else:
145 negative = (value is None) or (value < 0)
--> 146 if negative or not isinstance(value, (int, np.integer)):
147 raise ValueError(
148 "%s should be a nonnegative integer. "
D:\ProgramFiles\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1327
1328 def __nonzero__(self):
-> 1329 raise ValueError(
1330 f"The truth value of a {type(self).__name__} is ambiguous. "
1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Run Code Online (Sandbox Code Playgroud)
我认为我的可调用对象有问题。你知道我做错了什么吗?
import pandas as pd
import numpy as np
import gower
from sklearn.preprocessing import LabelEncoder
from sklearn_extra.cluster import KMedoids
# Create a dataframe with both numeric and string type columns
age = [21, 21, 19, 30, 21, 21, 19, 30, 35, 39, 50, 2]
gender = ['M', 'M', 'N', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M']
civil_status = ['MARRIED', 'SINGLE', 'SINGLE', 'SINGLE', 'MARRIED', 'SINGLE', 'WIDOW', 'DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED']
salary = [3000.0, 1200.0 , 32000.0, 1800.0 , 2900.0 , 1100.0 , 10000.0, 1500.0, 200.0, 500.0, 50.0, 5000.0]
available_credit = [2200, 100, 22000, 1100, 2000, 100, 6000, 2200, 6000, 12000, 500, 50]
df_eg = pd.DataFrame({'age': age,
'gender': gender,
'civil_status': civil_status,
'salary': salary,
'available_credit': available_credit})
# Label encode categorical variables
df_eg_encoded = df_eg.copy() # Avoid Pandas error
df_eg_encoded[['gender', 'civil_status']] = df_eg_encoded[['gender', 'civil_status']].apply(LabelEncoder().fit_transform)
# Apply Gower distance calculation
gower_mat = gower.gower_matrix(df_eg, cat_features = [False, True, True, False, False])
# Fit model
km_model = KMedoids(n_clusters = 3, random_state = 0, metric = 'precomputed', method = 'pam', init = 'k-medoids++').fit(gower_mat)
clusters = km_model.labels_
clusters
> array([1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1], dtype=int64)
Run Code Online (Sandbox Code Playgroud)
install.packages("clusters")
age <- c(21,21,19, 30,21,21,19,30, 35, 39, 50, 2)
gender <- c('M','M','N','M','F','F','F','F', 'F', 'M', 'F', 'M')
civil_status <- c('MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED')
salary <-c (3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0, 200.0, 500.0, 50.0, 5000.0)
available_credit <- c (2200,100,22000,1100,2000,100,6000,2200, 6000, 12000, 500, 50)
X <- data.frame(age, gender, civil_status, salary, available_credit)
print(X)
library(cluster)
gower_mat <- daisy(X, metric = c("gower"))
pamx <- pam(gower_mat, 3)
print(pamx)
> Clustering vector:
> [1] 1 1 2 1 1 3 3 3 3 1 3 1
Run Code Online (Sandbox Code Playgroud)
https://pypi.org/project/gower/ https://scikit-learn-extra.readthedocs.io/en/stable/ generated/ sklearn_extra.cluster.KMedoids.html https://www.rdocumentation.org/packages /cluster/versions/2.1.2/topics/daisy https://www.rdocumentation.org/packages/cluster/versions/2.1.2/topics/pam
| 归档时间: |
|
| 查看次数: |
4048 次 |
| 最近记录: |