Spark.read() 一次读取多个路径，而不是在 for 循环中逐一读取

Question

Spark.read() 一次读取多个路径，而不是在 for 循环中逐一读取

Nik*_*kSp 4 python apache-spark pyspark azure-data-lake databricks

我正在运行以下代码：

list_of_paths 是一个包含以 avro 文件结尾的路径的列表。例如，

['folder_1/folder_2/0/2020/05/15/10/41/08.avro', 'folder_1/folder_2/0/2020/05/15/11/41/08.avro', 'folder_1/folder_2/0/2020/05/15/12/41/08.avro']

Run Code Online (Sandbox Code Playgroud)

注意：以上路径存储在Azure Data Lake存储中，以下过程在Databricks中执行

spark.conf.set("fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name), storage_account_key)
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
begin_time = time.time()

for i in range(len(list_of_paths)):

    try:
      read_avro_data,avro_decoded=None,None

      #Read paths from Azure Data Lake "abfss"
      read_avro_data=spark.read.format("avro").load("abfss://{0}@{1}.dfs.core.windows.net/{2}".format(storage_container_name, storage_account_name, list_of_paths[i]))

    except Exception as e:
      custom_log(e)

Run Code Online (Sandbox Code Playgroud)

模式

read_avro_data.printSchema()

root
 |-- SequenceNumber: long (nullable = true)
 |-- Offset: string (nullable = true)
 |-- EnqueuedTimeUtc: string (nullable = true)
 |-- SystemProperties: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- member0: long (nullable = true)
 |    |    |-- member1: double (nullable = true)
 |    |    |-- member2: string (nullable = true)
 |    |    |-- member3: binary (nullable = true)
 |-- Properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- member0: long (nullable = true)
 |    |    |-- member1: double (nullable = true)
 |    |    |-- member2: string (nullable = true)
 |    |    |-- member3: binary (nullable = true)
 |-- Body: binary (nullable = true) 
# this is the content of the AVRO file.

Run Code Online (Sandbox Code Playgroud)

行数和列数

print("ROWS: ", read_avro_data.count(), ", NUMBER OF COLUMNS: ", len(read_avro_data.columns))

ROWS:  2 , NUMBER OF COLUMNS:  6

Run Code Online (Sandbox Code Playgroud)

我想要的不是每次迭代读取 1 个 AVRO 文件，因此一次迭代读取 2 行内容。相反，我想一次读取所有 AVRO 文件。所以我的最终 Spark DataFrame 中有 2x3 = 6 行内容。

这对于spark.read()可行吗？像下面这样：

spark.read.format("avro").load("abfss://{0}@{1}.dfs.core.windows.net/folder_1/folder_2/0/2020/05/15/*")

Run Code Online (Sandbox Code Playgroud)

[更新] 抱歉对通配符(*)的误解。这意味着所有 AVRO 文件都位于同一文件夹中。相反，我每个 AVRO 文件有 1 个文件夹。所以 3 个 AVRO 文件，3 个文件夹。在这种情况下，通配符将不起作用。下面回答的解决方案是使用带有路径名的列表 []。

预先感谢您的帮助和建议。

Answer 1

Sri*_*vas 7

load(path=None, format=None, schema=None, **options)此方法将接受单个路径或路径列表。

例如，您可以直接传递路径列表，如下所示

spark.read.format("avro").load(["/tmp/dataa/userdata1.avro","/tmp/dataa/userdata2.avro"]).count()

1998

Run Code Online (Sandbox Code Playgroud)

您可以使用通配符“*”之类的东西，它会自动并行读取所有 avro 文件。性能不会有任何问题 `spark.read.format('avro').load('python/test_support/sql/*')` (2认同)
最好将所有路径传递给 Spark，它将并行加载文件。如果您使用 foreach ，它将按顺序加载文件。 (2认同)

归档时间：	6 年，1 月前
查看次数：	9067 次
最近记录：	6 年，1 月前