Dar*_*ero 11 apache-spark-sql pyspark
我有:
key value
a [1,2,3]
b [2,3,4]
Run Code Online (Sandbox Code Playgroud)
我想要:
key value1 value2 value3
a 1 2 3
b 2 3 4
Run Code Online (Sandbox Code Playgroud)
似乎在scala我可以写:df.select($"value._1", $"value._2", $"value._3"),但在python中是不可能的.
那么有一个很好的方法吗?
MaF*_*aFF 34
这取决于你的"列表"的类型:
如果是类型ArrayType():
df = hc.createDataFrame(sc.parallelize([['a', [1,2,3]], ['b', [2,3,4]]]), ["key", "value"])
df.printSchema()
df.show()
root
|-- key: string (nullable = true)
|-- value: array (nullable = true)
| |-- element: long (containsNull = true)
Run Code Online (Sandbox Code Playgroud)
您可以像使用python一样访问值[]:
df.select("key", df.value[0], df.value[1], df.value[2]).show()
+---+--------+--------+--------+
|key|value[0]|value[1]|value[2]|
+---+--------+--------+--------+
| a| 1| 2| 3|
| b| 2| 3| 4|
+---+--------+--------+--------+
+---+-------+
|key| value|
+---+-------+
| a|[1,2,3]|
| b|[2,3,4]|
+---+-------+
Run Code Online (Sandbox Code Playgroud)如果是类型StructType():(也许你通过读取JSON构建了数据帧)
df2 = df.select("key", psf.struct(
df.value[0].alias("value1"),
df.value[1].alias("value2"),
df.value[2].alias("value3")
).alias("value"))
df2.printSchema()
df2.show()
root
|-- key: string (nullable = true)
|-- value: struct (nullable = false)
| |-- value1: long (nullable = true)
| |-- value2: long (nullable = true)
| |-- value3: long (nullable = true)
+---+-------+
|key| value|
+---+-------+
| a|[1,2,3]|
| b|[2,3,4]|
+---+-------+
Run Code Online (Sandbox Code Playgroud)
您可以使用*以下方法直接"拆分"列:
df2.select('key', 'value.*').show()
+---+------+------+------+
|key|value1|value2|value3|
+---+------+------+------+
| a| 1| 2| 3|
| b| 2| 3| 4|
+---+------+------+------+
Run Code Online (Sandbox Code Playgroud)小智 8
我想将大小列表(数组)的情况添加到 pault 答案。
在我们的列包含中型数组(或大型数组)的情况下,仍然可以将它们拆分为列。
from pyspark.sql.types import * # Needed to define DataFrame Schema.
from pyspark.sql.functions import expr
# Define schema to create DataFrame with an array typed column.
mySchema = StructType([StructField("V1", StringType(), True),
StructField("V2", ArrayType(IntegerType(),True))])
df = spark.createDataFrame([['A', [1, 2, 3, 4, 5, 6, 7]],
['B', [8, 7, 6, 5, 4, 3, 2]]], schema= mySchema)
# Split list into columns using 'expr()' in a comprehension list.
arr_size = 7
df = df.select(['V1', 'V2']+[expr('V2[' + str(x) + ']') for x in range(0, arr_size)])
# It is posible to define new column names.
new_colnames = ['V1', 'V2'] + ['val_' + str(i) for i in range(0, arr_size)]
df = df.toDF(*new_colnames)
Run Code Online (Sandbox Code Playgroud)
结果是:
df.show(truncate= False)
+---+---------------------+-----+-----+-----+-----+-----+-----+-----+
|V1 |V2 |val_0|val_1|val_2|val_3|val_4|val_5|val_6|
+---+---------------------+-----+-----+-----+-----+-----+-----+-----+
|A |[1, 2, 3, 4, 5, 6, 7]|1 |2 |3 |4 |5 |6 |7 |
|B |[8, 7, 6, 5, 4, 3, 2]|8 |7 |6 |5 |4 |3 |2 |
+---+---------------------+-----+-----+-----+-----+-----+-----+-----+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
13955 次 |
| 最近记录: |