在Pyspark中将复杂的数据行划分为简单行

Question

在Pyspark中将复杂的数据行划分为简单行

K.A*_*Ali 5 python dataframe apache-spark apache-spark-sql pyspark

我有这个代码:

from pyspark import SparkContext
from pyspark.sql import SQLContext, Row

sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlContext.createDataFrame([
    Row(id=1, title=[Row(value=u'cars', max_dist=1000)]),
    Row(id=2, title=[Row(value=u'horse bus',max_dist=50), Row(value=u'normal bus',max_dist=100)]),
    Row(id=3, title=[Row(value=u'Airplane', max_dist=5000)]),
    Row(id=4, title=[Row(value=u'Bicycles', max_dist=20),Row(value=u'Motorbikes', max_dist=80)]),
    Row(id=5, title=[Row(value=u'Trams', max_dist=15)])])

documents.show(truncate=False)
#+---+----------------------------------+
#|id |title                             |
#+---+----------------------------------+
#|1  |[[1000,cars]]                     |
#|2  |[[50,horse bus], [100,normal bus]]|
#|3  |[[5000,Airplane]]                 |
#|4  |[[20,Bicycles], [80,Motorbikes]]  |
#|5  |[[15,Trams]]                      |
#+---+----------------------------------+

Run Code Online (Sandbox Code Playgroud)

我需要将所有复合行(例如2和4)拆分为多行,同时保留'id',以获得如下结果:

#+---+----------------------------------+
#|id |title                             |
#+---+----------------------------------+
#|1  |[1000,cars]                       |
#|2  |[50,horse bus]                    |
#|2  |[100,normal bus]                  |
#|3  |[5000,Airplane]                   |
#|4  |[20,Bicycles]                     |
#|4  |[80,Motorbikes]                   |
#|5  |[15,Trams]                        |
#+---+----------------------------------+

Run Code Online (Sandbox Code Playgroud)

Answer 1

zer*_*323 16

就是explode这样:

from pyspark.sql.functions import explode

documents.withColumn("title", explode("title"))
## +---+----------------+
## | id|           title|
## +---+----------------+
## |  1|     [1000,cars]|
## |  2|  [50,horse bus]|
## |  2|[100,normal bus]|
## |  3| [5000,Airplane]|
## |  4|   [20,Bicycles]|
## |  4| [80,Motorbikes]|
## |  5|      [15,Trams]|
## +---+----------------+

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，11 月前
查看次数：	3021 次
最近记录：	7 年，1 月前