我试图通过使用Spark ML api运行随机森林分类,但我遇到了将正确的数据帧输入创建到管道中的问题.
以下是示例数据:
age,hours_per_week,education,sex,salaryRange
38,40,"hs-grad","male","A"
28,40,"bachelors","female","A"
52,45,"hs-grad","male","B"
31,50,"masters","female","B"
42,40,"bachelors","male","B"
Run Code Online (Sandbox Code Playgroud)
age和hours_per_week是整数,而其他功能包括label salaryRange是分类(String)
加载这个csv文件(让我们称之为sample.csv)可以通过Spark csv库完成,如下所示:
val data = sqlContext.csvFile("/home/dusan/sample.csv")
Run Code Online (Sandbox Code Playgroud)
默认情况下,所有列都作为字符串导入,因此我们需要将"age"和"hours_per_week"更改为Int:
val toInt = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))
Run Code Online (Sandbox Code Playgroud)
只是为了检查架构现在的样子:
scala> dataFixed.printSchema
root
|-- age: integer (nullable = true)
|-- hours_per_week: integer (nullable = true)
|-- education: string (nullable = true)
|-- sex: string (nullable = true)
|-- salaryRange: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
然后设置交叉验证器和管道:
val rf = new RandomForestClassifier()
val pipeline …Run Code Online (Sandbox Code Playgroud)