Vam*_*ala 2 python apache-spark pyspark
当前的 Spark 数据框在一列的单元格级别具有 CSV 值,我尝试将其分解为新列。示例数据框
a_id features
1 2020 "a","b","c","d","constant1","1","0.1","aa"
2 2021 "a","b","c","d","constant2","1","0.2","ab"
3 2022 "a","b","c","d","constant3","1","0.3","ac","a","b","c","d","constant3","1.1","3.3","acx"
4 2023 "a","b","c","d","constant4","1","0.4","ad"
5 2024 "a","b","c","d","constant5","1","0.5","ae","a","b","c","d","constant5","1.2","6.3","xwy","a","b","c","d","constant5","2.2","8.3","bunr"
6 2025 "a","b","c","d","constant6","1","0.6","af"
Run Code Online (Sandbox Code Playgroud)
features 列有多个 csv 值,其中(a、b、c、d)充当标题,它们在某些单元格(第 3 行和第 5 行)中重复,我只想提取一个标题及其各自的值。预期数据帧的输出如图所示
输出火花数据帧
a_id a d
1 2020 constant1 ["aa"]
2 2021 constant2 ["ab"]
3 2022 constant3 ["ac","acx"]
4 2023 constant4 ["ad"]
5 2024 constant5 ["ae","xwy","bunr"]
6 2025 constant6 ["af"]
Run Code Online (Sandbox Code Playgroud)
如图所示,我只想提取 a 和 d 标题作为新列,其中 a 是常量,d 有多个值,其值作为列表。
请帮助如何在 pysaprk 中转换它。上面的数据帧是实时流数据帧。
仅使用 Pyspark/Spark SQL 函数:
,explode结果并删除空行split再次结果。现在每个 csv 值都是数组的一个元素a并d从数组的第一个和第四个元素a_idfrom pyspark.sql import functions as F
header='"a","b","c","d",'
num_headers = header.count(",")
df.withColumn("features", F.expr(f"replace(features, '{header}')")) \
.withColumn("features", F.expr(f"regexp_extract_all(features, '(([^,]*,?)\\{{{num_headers}}})')")) \
.withColumn("features", F.explode("features"))\
.filter("not features =''") \
.withColumn("features", F.split("features", ",")) \
.withColumn("a", F.expr("features[0]")) \
.withColumn("d", F.expr("features[3]")) \
.groupBy("a_id") \
.agg(F.first("a").alias("a"), F.collect_list("d").alias("d")) \
.show(truncate=False)
Run Code Online (Sandbox Code Playgroud)
输出:
+----+----------+---------------------+
|a_id|a |d |
+----+----------+---------------------+
|2020|"constant"|["aa"] |
|2022|"constant"|["ac", "acx"] |
|2025|"constant"|["af"] |
|2023|"constant"|["ad"] |
|2021|"constant"|["ab"] |
|2024|"constant"|["ae", "xwy", "bunr"]|
+----+----------+---------------------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
166 次 |
| 最近记录: |