PySpark：读取 Spark 数据帧中的多个 XML 文件（s3 路径列表）

Question

PySpark：读取 Spark 数据帧中的多个 XML 文件（s3 路径列表）

her*_*arn 5 apache-spark pyspark databricks

正如问题所示，我在列表中有一个 s3 路径列表

s3_paths = ["s3a://somebucket/1/file1.xml", "s3a://somebucket/3/file2.xml"]

Run Code Online (Sandbox Code Playgroud)

我正在使用 PySpark，想了解如何将所有这些 XML 文件一起加载到数据框中？类似于下面所示的示例。

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "head").load(s3_paths)

Run Code Online (Sandbox Code Playgroud)

我能够读取单个文件，但想找到加载所有文件的最佳方法。

Answer 1

Shu*_*ain 0

只需解压列表即可

s3_paths = ["s3a://somebucket/1/file1.xml", "s3a://somebucket/3/file2.xml"]

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "head").load(*s3_paths)

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，5 月前
查看次数：	3113 次
最近记录：	5 年，2 月前