Spark 文件流获取文件名

RaA*_*aAm 5 scala filestream apache-spark spark-streaming

我需要知道从输入目录流式传输的输入文件的文件名。

下面是scala编程中的spark FileStreaming代码

object FileStreamExample {
  def main(args: Array[String]): Unit = {

    val sparkSession = SparkSession.builder.master("local").getOrCreate()

    val input_dir = "src/main/resources/stream_input"
    val ck = "src/main/resources/chkpoint_dir"

    //create stream from folder
    val fileStreamDf = sparkSession.readStream.csv(input_dir)

    def fileNames() = fileStreamDf.inputFiles.foreach(println(_))

    println("Streaming Started...\n")
    //fileNames() //even here it is throwing the same exception
    val query = fileStreamDf.writeStream
      .format("console")
      .outputMode(OutputMode.Append())
      .option("checkpointLocation", ck)
      .start()

    fileNames();

    query.awaitTermination()

  }}
Run Code Online (Sandbox Code Playgroud)

但在流式传输时面临以下异常

Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[src/main/resources/stream_input]
Run Code Online (Sandbox Code Playgroud)

Ish*_*han 8

您可以使用input_file_name()定义的函数来org.apache.spark.sql.functions._获取将行导入数据帧的文件名。

sparkSession.readStream.csv(input_dir).withColumn("FileName", input_file_name())
Run Code Online (Sandbox Code Playgroud)