如何创建一个空的DataFrame?为什么"ValueError:RDD为空"?

use*_*768 19 apache-spark pyspark

我试图在Spark(Pyspark)中创建一个空数据帧.

我使用类似于这里讨论的方法在这里输入链接描述,但它不起作用.

这是我的代码

df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
Run Code Online (Sandbox Code Playgroud)

这是错误

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
Run Code Online (Sandbox Code Playgroud)

Ton*_*res 31

扩展Joe Widen的答案,你实际上可以创建没有字段的模式:

schema = StructType([])
Run Code Online (Sandbox Code Playgroud)

所以当你使用它作为你的架构创建DataFrame时,你最终会得到一个DataFrame[].

>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())
Run Code Online (Sandbox Code Playgroud)

在Scala中,如果您选择使用sqlContext.emptyDataFrame并签出架构,它将返回StructType().

scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []

scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()    
Run Code Online (Sandbox Code Playgroud)

  • 与命令spark.createDataFrame([[)])相同的结果 (2认同)

Joe*_*den 10

在写这个答案的时候,看起来你需要某种架构

from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)

sqlContext.createDataFrame(sc.emptyRDD(), schema)
Run Code Online (Sandbox Code Playgroud)


bra*_*raj 7

这适用于Spark 2.0.0或更高版本

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
Run Code Online (Sandbox Code Playgroud)