我试图通过使用Spark ML api运行随机森林分类,但我遇到了将正确的数据帧输入创建到管道中的问题.
以下是示例数据:
age,hours_per_week,education,sex,salaryRange
38,40,"hs-grad","male","A"
28,40,"bachelors","female","A"
52,45,"hs-grad","male","B"
31,50,"masters","female","B"
42,40,"bachelors","male","B"
Run Code Online (Sandbox Code Playgroud)
age和hours_per_week是整数,而其他功能包括label salaryRange是分类(String)
加载这个csv文件(让我们称之为sample.csv)可以通过Spark csv库完成,如下所示:
val data = sqlContext.csvFile("/home/dusan/sample.csv")
Run Code Online (Sandbox Code Playgroud)
默认情况下,所有列都作为字符串导入,因此我们需要将"age"和"hours_per_week"更改为Int:
val toInt = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))
Run Code Online (Sandbox Code Playgroud)
只是为了检查架构现在的样子:
scala> dataFixed.printSchema
root
|-- age: integer (nullable = true)
|-- hours_per_week: integer (nullable = true)
|-- education: string (nullable = true)
|-- sex: string (nullable = true)
|-- salaryRange: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
然后设置交叉验证器和管道:
val rf = new RandomForestClassifier()
val pipeline …Run Code Online (Sandbox Code Playgroud) 我试图使用SCALA中的随机森林分类器模型使用5倍交叉验证来找到准确度.但是我在运行时遇到以下错误:
java.lang.IllegalArgumentException:为RandomForestClassifier提供了带有无效标签列标签的输入,没有指定类的数量.请参见StringIndexer.
在行---> val cvModel = cv.fit(trainingData)获得上述错误
我用于使用随机森林进行数据集交叉验证的代码如下:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val data = sc.textFile("exprogram/dataset.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(41).toDouble,
Vectors.dense(parts(0).split(',').map(_.toDouble)))
}
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
val trainingData = training.toDF()
val testData = test.toDF()
val nFolds: Int = 5
val NumTrees: Int = 5
val rf = new
RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features") …Run Code Online (Sandbox Code Playgroud) scala machine-learning random-forest apache-spark apache-spark-mllib