小编Vla*_*nko的帖子

Apache Spark Gradient Boosted Tree训练运行速度慢

我正在使用Spark 1.4的ML库中的Gradient Boosted Trees学习算法进行实验.我正在解决二进制分类问题,我的输入是大约50,000个样本和~500,000个特征.我的目标是以人类可读的格式输出生成的GBT集合的定义.到目前为止,我的经验是,对于我的问题大小,向群集添加更多资源似乎对运行的长度没有影响.10次迭代训练似乎大约需要13个小时.这是不可接受的,因为我希望进行100-300次迭代运行,并且执行时间似乎随着迭代次数而爆炸.

我的Spark应用程序

这不是确切的代码,但可以简化为:

SparkConf sc = new SparkConf().setAppName("GBT Trainer")
            // unlimited max result size for intermediate Map-Reduce ops.
            // Having no limit is probably bad, but I've not had time to find
            // a tighter upper bound and the default value wasn't sufficient.
            .set("spark.driver.maxResultSize", "0");
JavaSparkContext jsc = new JavaSparkContext(sc)

// The input file is encoded in plain-text LIBSVM format ~59GB in size
<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD();

BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(10);
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(1); …

Run Code Online (Sandbox Code Playgroud)

machine-learning amazon-web-services elastic-map-reduce apache-spark

Vla*_*nko

2015 09-22

12
推荐指数

1
解决办法

1084
查看次数