And*_*man 5 apache-spark apache-spark-sql pyspark
考虑以下数据帧。在这里,我希望将映射数组合并到一个映射中,而不使用 UDF。
+---+------------------------------------+
|id |greek |
+---+------------------------------------+
|1 |[{alpha -> beta}, {gamma -> delta}] |
|2 |[{epsilon -> zeta}, {etha -> theta}]|
+---+------------------------------------+
Run Code Online (Sandbox Code Playgroud)
我想我已经尝试了pyspark 3 文档中的所有映射功能。我以为我能够做到,但它只是抛出一个异常,它说它需要地图而不是地图数组?map_from_entries
尽管我知道使用 UDF 可以轻松完成此操作,但我很难相信没有更简单的方法吗?
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.getOrCreate()
)
df = spark.createDataFrame([
(1, [{"alpha": "beta"}, {"gamma": "delta"}]),
(2, [{"epsilon": "zeta"}, {"etha": "theta"}])
],
schema=["id", "greek"]
)
Run Code Online (Sandbox Code Playgroud)
使用高阶函数的另一个版本:
map_schema = df.selectExpr('greek[0]').dtypes[0][1]
expr = "REDUCE(greek, cast(map() as {schema}), (acc, el) -> map_concat(acc, el))".format(schema=map_schema)
df = df.withColumn("Concated", F.expr(expr))
Run Code Online (Sandbox Code Playgroud)
输出:
+---+------------------------------------+--------------------------------+
|id |greek |Concated |
+---+------------------------------------+--------------------------------+
|1 |[{alpha -> beta}, {gamma -> delta}] |{alpha -> beta, gamma -> delta} |
|2 |[{epsilon -> zeta}, {etha -> theta}]|{epsilon -> zeta, etha -> theta}|
+---+------------------------------------+--------------------------------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3255 次 |
| 最近记录: |