我想在 pyspark 数据框中创建测试数据,但总是收到相同的“元组索引超出范围”错误。读取 csv 时我没有收到此错误。对于我为什么会收到此错误的任何想法,我将不胜感激。
我尝试的第一件事是创建一个 pandas 数据框并将其转换为 pyspark 数据框:
columns = ["id","col_"]
data = [("1", "blue"), ("2", "green"),
("3", "purple"), ("4", "red"),
("5", "yellow")]
df = pd.DataFrame(data=data, columns=columns)
sparkdf = spark.createDataFrame(df)
sparkdf.show()
Run Code Online (Sandbox Code Playgroud)
输出:
PicklingError: Could not serialize object: IndexError: tuple index out of range
Run Code Online (Sandbox Code Playgroud)
如果我尝试按照SparkbyExamples.com说明从 RDD 创建数据帧,我会收到相同的错误:
rdd = spark.sparkContext.parallelize(data)
sparkdf = spark.createDataFrame(rdd).toDF(*columns)
sparkdf.show()
Run Code Online (Sandbox Code Playgroud)
我也尝试了以下方法并得到了相同的错误:
import pyspark.pandas as ps
df1 = ps.from_pandas(df)
Run Code Online (Sandbox Code Playgroud)
这是运行上述代码时的完整错误:
IndexError Traceback (most recent call last)
File c:\Users\jonat\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\serializers.py:458, in CloudPickleSerializer.dumps(self, obj)
457 try:
--> 458 …Run Code Online (Sandbox Code Playgroud)