当我使用 Spark 和 sklearn 计算具有相同数据和相同预测值的轮廓分数时,我得到了不同的结果。
这是用于 Spark 的代码:
>>> prediction.show()
+---+---+---------+----------+
| a| b| features|prediction|
+---+---+---------+----------+
| 1| 1|[1.0,1.0]| 1|
| 2| 2|[2.0,2.0]| 1|
| 3| 3|[3.0,3.0]| 0|
| 4| 4|[4.0,4.0]| 0|
+---+---+---------+----------+
>>> from pyspark.ml.evaluation import ClusteringEvaluator
>>> evaluator = ClusteringEvaluator()
>>> silhouette = evaluator.evaluate(prediction)
>>> silhouette
0.7230769230769223
Run Code Online (Sandbox Code Playgroud)
这是用于 sklearn 的代码:
>>> from sklearn.cluster import KMeans
>>> from sklearn import metrics
>>> x=[[1,1],[2,2],[3,3],[4,4]]
>>> prediction = KMeans(n_clusters=2,max_iter=1000,random_state=123).fit_predict(x)
>>> prediction
array([1, 1, 0, 0], dtype=int32)
>>> silhouette = metrics.silhouette_score(x, prediction) …Run Code Online (Sandbox Code Playgroud) python cluster-analysis machine-learning scikit-learn apache-spark