从apache spark中的文本文件中查找存储在rdd中的数据大小

bob*_*bob 4 scala apache-spark apache-spark-1.4

我是Apache Spark(版本1.4.1)的新手.我写了一个小代码来读取文本文件并将其数据存储在Rdd中.

有没有办法在rdd中获取数据大小.

这是我的代码:

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row

object RddSize {

  def main(args: Array[String]) {

    val sc = new SparkContext("local", "data size")
    val FILE_LOCATION = "src/main/resources/employees.csv"
    val peopleRdd = sc.textFile(FILE_LOCATION)

    val newRdd = peopleRdd.filter(str => str.contains(",M,"))
    //Here I want to find whats the size remaining data
  }
} 
Run Code Online (Sandbox Code Playgroud)

我希望在过滤转换(peopleRdd)之前和之后(newRdd)获取数据大小.

Ami*_*bey 8

获取RDD大小的方法有多种

1.在spark上下文中添加spark监听器

SparkDriver.getContext.addSparkListener(new SparkListener() {
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
  val map = stageCompleted.stageInfo.rddInfos
  map.foreach(row => {
      println("rdd memSize " + row.memSize)
      println("rdd diskSize " + row.diskSize)
   })
}})
Run Code Online (Sandbox Code Playgroud)

2.将rdd保存为文本文件.

myRDD.saveAsTextFile("person.txt")
Run Code Online (Sandbox Code Playgroud)

并调用Apache Spark REST API.

/applications/[app-id]/stages
Run Code Online (Sandbox Code Playgroud)

你也可以试试SizeEstimater

val rddSize = SizeEstimator.estimate(myRDD)
Run Code Online (Sandbox Code Playgroud)