bob*_*bob 4 scala apache-spark apache-spark-1.4
我是Apache Spark(版本1.4.1)的新手.我写了一个小代码来读取文本文件并将其数据存储在Rdd中.
有没有办法在rdd中获取数据大小.
这是我的代码:
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row
object RddSize {
def main(args: Array[String]) {
val sc = new SparkContext("local", "data size")
val FILE_LOCATION = "src/main/resources/employees.csv"
val peopleRdd = sc.textFile(FILE_LOCATION)
val newRdd = peopleRdd.filter(str => str.contains(",M,"))
//Here I want to find whats the size remaining data
}
}
Run Code Online (Sandbox Code Playgroud)
我希望在过滤转换(peopleRdd)之前和之后(newRdd)获取数据大小.
获取RDD大小的方法有多种
1.在spark上下文中添加spark监听器
SparkDriver.getContext.addSparkListener(new SparkListener() {
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
val map = stageCompleted.stageInfo.rddInfos
map.foreach(row => {
println("rdd memSize " + row.memSize)
println("rdd diskSize " + row.diskSize)
})
}})
Run Code Online (Sandbox Code Playgroud)
2.将rdd保存为文本文件.
myRDD.saveAsTextFile("person.txt")
Run Code Online (Sandbox Code Playgroud)
/applications/[app-id]/stages
Run Code Online (Sandbox Code Playgroud)
你也可以试试SizeEstimater
val rddSize = SizeEstimator.estimate(myRDD)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5941 次 |
| 最近记录: |