我有这个代码:
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlContext.createDataFrame([
Row(id=1, title=[Row(value=u'cars', max_dist=1000)]),
Row(id=2, title=[Row(value=u'horse bus',max_dist=50), Row(value=u'normal bus',max_dist=100)]),
Row(id=3, title=[Row(value=u'Airplane', max_dist=5000)]),
Row(id=4, title=[Row(value=u'Bicycles', max_dist=20),Row(value=u'Motorbikes', max_dist=80)]),
Row(id=5, title=[Row(value=u'Trams', max_dist=15)])])
documents.show(truncate=False)
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[[1000,cars]] |
#|2 |[[50,horse bus], [100,normal bus]]|
#|3 |[[5000,Airplane]] |
#|4 |[[20,Bicycles], [80,Motorbikes]] |
#|5 |[[15,Trams]] |
#+---+----------------------------------+
Run Code Online (Sandbox Code Playgroud)
我需要将所有复合行(例如2和4)拆分为多行,同时保留'id',以获得如下结果:
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[1000,cars] |
#|2 |[50,horse bus] |
#|2 |[100,normal bus] |
#|3 …
Run Code Online (Sandbox Code Playgroud) 我有一个数据框(具有更多行和列),如下所示。
示例 DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(col1 = 'z1', col2 = '[a1, b2, c3]', col3 = 'foo')])
# +------+-------------+------+
# | col1| col2| col3|
# +------+-------------+------+
# | z1| [a1, b2, c3]| foo|
# +------+-------------+------+
df
# DataFrame[col1: string, col2: string, col3: string]
Run Code Online (Sandbox Code Playgroud)
我想要的是:
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| z1| a1| foo|
| z1| b2| foo|
| z1| c3| foo|
+-----+-----+-----+
Run Code Online (Sandbox Code Playgroud)
我试图复制RDD
这里提供的解决方案: …