Spark-csv 不会将 DataFrame 保存到文件时有解释吗？

Question

Spark-csv 不会将 DataFrame 保存到文件时有解释吗？

dataFrame.coalesce(1).write().save("path")有时仅写入 _SUCCESS 和 ._SUCCESS.crc 文件，即使在非空输入上也没有预期的 *.csv.gzDataFrame

文件保存代码：

private static void writeCsvToDirectory(Dataset<Row> dataFrame, Path directory) {
    dataFrame.coalesce(1)
            .write()
            .format("csv")
            .option("header", "true")
            .option("delimiter", "\t")
            .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
            .mode(SaveMode.Overwrite)
            .save("file:///" + directory);
}

Run Code Online (Sandbox Code Playgroud)

文件获取代码：

static Path getTemporaryCsvFile(Path directory) throws IOException {
    String glob = "*.csv.gz";
    try (DirectoryStream<Path> stream = Files.newDirectoryStream(directory, glob)) {
        return stream.iterator().next();
    } catch (NoSuchElementException e) {
        throw new RuntimeException(getNoSuchElementExceptionMessage(directory, glob), e);
    }
}

Run Code Online (Sandbox Code Playgroud)

文件获取错误示例：

java.lang.RuntimeException: directory /tmp/temp5889805853850415940 does not contain a file with glob *.csv.gz. Directory listing:
    /tmp/temp5889805853850415940/_SUCCESS, 
    /tmp/temp5889805853850415940/._SUCCESS.crc

Run Code Online (Sandbox Code Playgroud)

我依赖这种期望，有人可以解释我为什么会这样吗？

Answer 1

Art*_*sia 5

输出文件应该（逻辑上必须）至少包含标题行和一些数据行。但他根本不存在

这个评论有点误导。根据Github上的代码，只有当Dataframe为空时才会发生这种情况，并且不会产生SUCCESS文件。考虑到这些文件存在 - Dataframe 不为空并且writeCsvToDirectory代码中的被触发。

我有一些问题：

您的 Spark 作业完成时是否没有错误？
文件的时间戳是否SUCCESS更新？

我的两个主要嫌疑人是：

coalesce(1)- 如果您有大量数据，这可能会失败
SaveMode.Overwrite- 我有一种感觉，那些 SUCCESS 文件位于之前运行的该文件夹中

归档时间：	6 年，2 月前
查看次数：	4159 次
最近记录：	5 年，6 月前