我想了解的给予不同的影响numSlices到parallelize()在方法SparkContext.下面给出的是Syntax该方法
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]
Run Code Online (Sandbox Code Playgroud)
我在本地模式下运行spark-shell
spark-shell --master local
Run Code Online (Sandbox Code Playgroud)
我的理解是,numSlices决定结果RDD的分区号(在调用之后sc.parallelize()).考虑下面的几个例子
情况1
scala> sc.parallelize(1 to 9, 1);
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.partitions.size
res2: Int = 1
Run Code Online (Sandbox Code Playgroud)
案例2
scala> sc.parallelize(1 to 9, 2);
res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> res3.partitions.size
res4: Int = 2
Run Code Online (Sandbox Code Playgroud)
案例3
scala> sc.parallelize(1 to 9, 3);
res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
scala> res3.partitions.size
res6: Int = 2
Run Code Online (Sandbox Code Playgroud)
案例4
scala> sc.parallelize(1 to 9, 4);
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22
scala> res3.partitions.size
res8: Int = 2
Run Code Online (Sandbox Code Playgroud)
问题1:在案例3和案例4中,我期望分区大小分别为3&4,但两种情况都只有分区大小2.这是什么原因?
问题2:在每种情况下都有一个与之相关的数字ParallelCollectionRDD[no].即在案例1中ParallelCollectionRDD[0],在案例2中它是ParallelCollectionRDD[1]等等.这些数字到底意味着什么?
Mat*_*ves 22
问题1:这是你的错字.你打电话res3.partitions.size,而不是res5和res7分别.当我用正确的数字做它时,它按预期工作.
问题2:这是Spark上下文中RDD的id ,用于保持图表的直线.看看当我运行三次相同命令时会发生什么:
scala> sc.parallelize(1 to 9,1)
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
Run Code Online (Sandbox Code Playgroud)
现在有三种不同的RDD具有三种不同的ID.我们可以运行以下内容来检查:
scala> (res0.id, res1.id, res2.id)
res3: (Int, Int, Int) = (0,1,2)
Run Code Online (Sandbox Code Playgroud)