SparkContext中的parallelize()方法

Raj*_*Raj 15 apache-spark

我想了解的给予不同的影响numSlicesparallelize()在方法SparkContext.下面给出的是Syntax该方法

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]
Run Code Online (Sandbox Code Playgroud)

我在本地模式下运行spark-shell

spark-shell --master local
Run Code Online (Sandbox Code Playgroud)

我的理解是,numSlices决定结果RDD的分区号(在调用之后sc.parallelize()).考虑下面的几个例子

情况1

scala> sc.parallelize(1 to 9, 1);
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.partitions.size
res2: Int = 1
Run Code Online (Sandbox Code Playgroud)

案例2

scala> sc.parallelize(1 to 9, 2);
res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> res3.partitions.size
res4: Int = 2
Run Code Online (Sandbox Code Playgroud)

案例3

scala> sc.parallelize(1 to 9, 3);
res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
scala> res3.partitions.size
res6: Int = 2
Run Code Online (Sandbox Code Playgroud)

案例4

scala> sc.parallelize(1 to 9, 4);
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22
scala> res3.partitions.size
res8: Int = 2
Run Code Online (Sandbox Code Playgroud)

问题1:在案例3案例4中,我期望分区大小分别为3&4,但两种情况都只有分区大小2.这是什么原因?

问题2:在每种情况下都有一个与之相关的数字ParallelCollectionRDD[no].即在案例1中ParallelCollectionRDD[0],在案例2中它是ParallelCollectionRDD[1]等等.这些数字到底意味着什么?

Mat*_*ves 22

问题1:这是你的错字.你打电话res3.partitions.size,而不是res5res7分别.当我用正确的数字做它时,它按预期工作.

问题2:这是Spark上下文中RDD的id ,用于保持图表的直线.看看当我运行三次相同命令时会发生什么:

scala> sc.parallelize(1 to 9,1)
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22

scala> sc.parallelize(1 to 9,1)
res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22

scala> sc.parallelize(1 to 9,1)
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
Run Code Online (Sandbox Code Playgroud)

现在有三种不同的RDD具有三种不同的ID.我们可以运行以下内容来检查:

scala> (res0.id, res1.id, res2.id)
res3: (Int, Int, Int) = (0,1,2)
Run Code Online (Sandbox Code Playgroud)

  • 谢谢@Matthew Graves的答案.我这么愚蠢的做出那个错字:). (6认同)