kau*_*mar 4 python cluster-analysis k-means apache-spark pyspark
我正在 pyspark 中试验聚类模型。我试图获得适合不同 K 值的簇的均方成本
def meanScore(k,df):
inputCol = df.columns[:38]
assembler = VectorAssembler(inputCols=inputCols,outputCol="features")
kmeans = KMeans().setK(k)
pipeModel2 = Pipeline(stages=[assembler,kmeans])
kmeansModel = pipeModel2.fit(df).stages[-1]
kmeansModel.computeCost(assembler.transform(df))/data.count()
Run Code Online (Sandbox Code Playgroud)
当我尝试调用此函数来计算数据框中不同 K 值的成本时
for k in range(20,100,20):
sc = meanScore(k,numericOnly)
print((k,sc))
Run Code Online (Sandbox Code Playgroud)
我收到属性错误 AttributeError: 'KMeansModel' object has no attribute 'computeCost'
我对 pyspark 相当陌生,刚刚学习,我真诚地感谢对此的任何帮助。谢谢
Dho*_*eb 6
正如 Erkan Sirin 提到的,computeCost 在最近的版本中已被弃用,这可能会帮助您解决问题
# Make predictions
predictions = model.transform(dataset)
from pyspark.ml.evaluation import ClusteringEvaluator
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
Run Code Online (Sandbox Code Playgroud)
希望对您有帮助,您可以查看官方文档以获取更多信息
归档时间: |
|
查看次数: |
7206 次 |
最近记录: |