小编ni_*_*ama的帖子

Spark on Cluster：读取大量小 avro 文件需要很长时间才能列出

我知道在 HDFS 中读取大量小文件的问题一直是一个问题并被广泛讨论，但请耐心等待。处理此类问题的大多数 stackoverflow 问题都与读取大量 txt 文件有关。我正在尝试读取大量小 avro 文件

另外，这些读取txt文件解决方案讨论使用WholeTextFileInputFormat或CombineInputFormat（/sf/answers/3072911341/）它们是RDD实现，我使用Spark 2.4（HDFS 3.0.0）并且通常不鼓励使用RDD实现和数据框是首选。我更喜欢使用数据帧，但也对 RDD 实现持开放态度。

我已经按照 Murtaza 的建议尝试合并数据帧，但在大量文件上出现 OOM 错误（/sf/answers/2248236301/）

我正在使用以下代码

val filePaths = avroConsolidator.getFilesInDateRangeWithExtension //pattern:filePaths: Array[String] 
//I do need to create a list of file paths as I need to filter files based on file names. Need this logic for some upstream process
//example : Array("hdfs://server123:8020/source/Avro/weblog/2019/06/03/20190603_1530.avro","hdfs://server123:8020/source/Avro/weblog/2019/06/03/20190603_1531.avro","hdfs://server123:8020/source/Avro/weblog/2019/06/03/20190603_1532.avro")
val df_mid = sc.read.format("com.databricks.spark.avro").load(filePaths: _*)
      val df = df_mid
        .withColumn("dt", date_format(df_mid.col("timeStamp"), "yyyy-MM-dd"))
        .filter("dt != 'null'")

      df
        .repartition(partitionColumns(inputs.logSubType).map(new org.apache.spark.sql.Column(_)):_*)
        .write.partitionBy(partitionColumns(inputs.logSubType): _*) …

Run Code Online (Sandbox Code Playgroud)

scala hdfs apache-spark spark-avro

ni_*_*ama

2019 07-10

5
推荐指数

1
解决办法

1902
查看次数