将行列表转换为 PySpark 数据帧

mbl*_*ume 12 python rows apache-spark apache-spark-sql pyspark

我有以下要转换为 PySpark df 的行列表:

data= [Row(id=u'1', probability=0.0, thresh=10, prob_opt=0.45),
 Row(id=u'2', probability=0.4444444444444444, thresh=60, prob_opt=0.45),
 Row(id=u'3', probability=0.0, thresh=10, prob_opt=0.45),
 Row(id=u'80000000808', probability=0.0, thresh=100, prob_opt=0.45)]
Run Code Online (Sandbox Code Playgroud)

我需要将其转换为 PySpark DF。

我尝试过这样做data.toDF()

属性错误:“列表”对象没有属性“toDF”

Zyg*_*ygD 12

这似乎有效:

spark.createDataFrame(data)
Run Code Online (Sandbox Code Playgroud)

检测结果:

from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.getOrCreate()

data = [Row(id=u'1', probability=0.0, thresh=10, prob_opt=0.45),
        Row(id=u'2', probability=0.4444444444444444, thresh=60, prob_opt=0.45),
        Row(id=u'3', probability=0.0, thresh=10, prob_opt=0.45),
        Row(id=u'80000000808', probability=0.0, thresh=100, prob_opt=0.45)]

df = spark.createDataFrame(data)
df.show()
#  +-----------+------------------+------+--------+
#  |         id|       probability|thresh|prob_opt|
#  +-----------+------------------+------+--------+
#  |          1|               0.0|    10|    0.45|
#  |          2|0.4444444444444444|    60|    0.45|
#  |          3|               0.0|    10|    0.45|
#  |80000000808|               0.0|   100|    0.45|
#  +-----------+------------------+------+--------+
Run Code Online (Sandbox Code Playgroud)


小智 5

您可以尝试以下代码:

from pyspark.sql import Row

rdd = sc.parallelize(data)

df=rdd.toDF()
Run Code Online (Sandbox Code Playgroud)


mbl*_*ume 1

找到答案了!

rdd = sc.parallelize(data)

df=spark.createDataFrame(rdd, ['id', 'probability','thresh','prob_opt'])
Run Code Online (Sandbox Code Playgroud)