小编And*_*bin的帖子

Apache Spark:尝试索引字符串列时出现StackOverflowError

我有大约5000行和950列的csv文件.首先我将它加载到DataFrame:

val data = sqlContext.read
  .format(csvFormat)
  .option("header", "true")
  .option("inferSchema", "true")
  .load(file)
  .cache()

Run Code Online (Sandbox Code Playgroud)

之后我搜索所有字符串列

val featuresToIndex = data.schema
  .filter(_.dataType == StringType)
  .map(field => field.name)

Run Code Online (Sandbox Code Playgroud)

并想要索引它们.为此,我为每个字符串列创建索引器

val stringIndexers = featuresToIndex.map(colName =>
  new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "Indexed"))

Run Code Online (Sandbox Code Playgroud)

并创建管道

val pipeline = new Pipeline().setStages(stringIndexers.toArray)

Run Code Online (Sandbox Code Playgroud)

但是当我尝试用这个管道转换我的初始数据帧时

val indexedDf = pipeline.fit(data).transform(data)

Run Code Online (Sandbox Code Playgroud)

我得到StackOverflowError

16/07/05 16:55:12 INFO DAGScheduler: Job 4 finished: countByValue at StringIndexer.scala:86, took 7.882774 s
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.immutable.Set$Set1.contains(Set.scala:84)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:86)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:81)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20)
at scala.collection.generic.Growable$class.loop$1(Growable.scala:53)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at …

Run Code Online (Sandbox Code Playgroud)

java scala apache-spark apache-spark-mllib

And*_*bin

2016 07-05

18
推荐指数

2
解决办法

3875
查看次数