mat*_*bit 3 python pyspark spark-dataframe apache-spark-mllib
我明白为了使用ml.clustering Kmeans算法(实际上任何ml算法?)和数据帧,我需要让我的数据帧具有某种形状:(id,vector []),或类似的东西.如何应用正确的转换将常规表(存储在df中)转换为所需的结构?这是我的df:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
#-----------------------------
#creating DF:
l = [('user1', 2,1,4),('user2',3,5,6)]
temp_df = spark.createDataFrame(l)
temp_df.show()
+-----+---+---+---+
| _1| _2| _3| _4|
+-----+---+---+---+
|user1| 2| 1| 4|
|user2| 3| 5| 6|
+-----+---+---+---+
Run Code Online (Sandbox Code Playgroud)
我想用:
from pyspark.ml.clustering import KMeans
kmean = KMeans().setK(2).setSeed(1)
model = kmean.fit(temp_df)
Run Code Online (Sandbox Code Playgroud)
我得到:IllegalArgumentException:u'Field"features"不存在.
谢谢,
归档时间: |
|
查看次数: |
1170 次 |
最近记录: |