小编has*_*vam的帖子

AttributeError: 'DataFrame' 对象没有属性 '_data'

在 Pandas 数据帧上并行化时 Azure Databricks 执行错误。代码能够创建RDD但在执行时中断.collect()

设置:

import pandas as pd
# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
  
# Create the pandas DataFrame 
my_df = pd.DataFrame(data, columns = ['Name', 'Age']) 

def testfn(i):
  return my_df.iloc[i]
test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
print (test_var)
Run Code Online (Sandbox Code Playgroud)

错误:

Py4JJavaError                             Traceback (most recent call last)
<command-2941072546245585> in <module>
      1 def testfn(i):
      2   return my_df.iloc[i]
----> 3 test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
      4 print (test_var)

/databricks/spark/python/pyspark/rdd.py in collect(self)
    901         # Default path used in OSS Spark / for non-credential …
Run Code Online (Sandbox Code Playgroud)

python apache-spark pyspark databricks azure-databricks

4
推荐指数
1
解决办法
3432
查看次数