读取大量文件时，如何提高 TextIO 或 AvroIO 的性能？

Question

读取大量文件时，如何提高 TextIO 或 AvroIO 的性能？

jkf*_*kff 5 google-cloud-dataflow apache-beam apache-beam-io

TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files.

How can I read such a large number of files efficiently?

Answer 1

jkf*_*kff 4

当您提前知道正在读取的文件模式TextIO或AvroIO将扩展到大量文件时，您可以使用最近添加的功能.withHintMatchesManyFiles()，该功能目前已在TextIO和上实现AvroIO。

例如：

PCollection<String> lines = p.apply(TextIO.read()
    .from("gs://some-bucket/many/files/*")
    .withHintMatchesManyFiles());

Run Code Online (Sandbox Code Playgroud)

使用此提示会导致转换以针对读取大量文件进行优化的方式执行：在这种情况下可以读取的文件数量实际上是无限的，并且很可能管道将比不使用此提示运行得更快、更便宜且更可靠这个提示。

但是，如果文件模式实际上仅匹配少量文件（例如，几十个或几百个文件），则它的性能可能比没有提示时更差。

在幕后，此提示导致转换分别通过TextIO.readAll()或执行AvroIO.readAll()，这是更灵活和可扩展的版本，read()允许读取PCollection<String>文件模式（其中每个String都是一个文件模式），但有相同的警告：如果文件总数与filepatterns 很小，它们的性能可能比read()在管道构建时指定的简单 filepattern 差。

归档时间：	8 年，1 月前
查看次数：	1254 次
最近记录：	8 年，1 月前