标签: apache-spark-1.2

群集使用Spark 1.2.0 EC2启动脚本挂起'ssh-ready'状态

我正在尝试使用预先打包的EC2脚本启动一个独立的Spark集群,但它只是无限期地挂起'ssh-ready'状态:

ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k <key-pair> -i <identity-file>.pem -r us-west-2 -s 3 launch test
Setting up security groups...
Searching for existing cluster test...
Spark AMI: ami-ae6e0d9e
Launching instances...
Launched 3 slaves in us-west-2c, regid = r-b_______6
Launched master in us-west-2c, regid = r-0______0
Waiting for all instances in cluster to enter 'ssh-ready' state..........

Run Code Online (Sandbox Code Playgroud)

然而,我可以在没有投诉的情

ubuntu@machine:~$ ssh -i <identity-file>.pem root@master-ip
Last login: Day MMM DD HH:mm:ss 20YY from c-AA-BBB-CCCC-DDD.eee1.ff.provider.net

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
There are …

Run Code Online (Sandbox Code Playgroud)

amazon-ec2 amazon-web-services apache-spark apache-spark-1.2

nmu*_*thy

2015 08-24

5
推荐指数

1
解决办法

3243
查看次数

如何在Apache Spark中编码分类功能

我有一组数据,我想根据这些数据创建一个分类模型.每行都有以下形式:

user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1

Run Code Online (Sandbox Code Playgroud)

大约有1M个用户,2个类和1M个产品.我接下来要做的是创建稀疏向量(MLlib已经支持的东西)但为了应用该函数,我必须首先创建密集向量(使用0).换句话说,我必须将数据二进制化.这样做最简单(或最优雅)的方式是什么？

鉴于我是MLlib的新手,请问您提供一个具体的例子？我正在使用MLlib 1.2.

编辑

我最终得到了以下一段代码,但事实证明是非常慢......除了提供我只能使用MLlib 1.2之外的任何其他想法？

val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=>
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}

Run Code Online (Sandbox Code Playgroud)

scala apache-spark apache-spark-1.2 apache-spark-mllib

use*_*838

2016 04-25

5
推荐指数

1
解决办法

8452
查看次数