在 Pandas 数据帧上并行化时 Azure Databricks 执行错误。代码能够创建RDD但在执行时中断.collect()
设置:
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
my_df = pd.DataFrame(data, columns = ['Name', 'Age'])
def testfn(i):
return my_df.iloc[i]
test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
print (test_var)
Run Code Online (Sandbox Code Playgroud)
错误:
Py4JJavaError Traceback (most recent call last)
<command-2941072546245585> in <module>
1 def testfn(i):
2 return my_df.iloc[i]
----> 3 test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
4 print (test_var)
/databricks/spark/python/pyspark/rdd.py in collect(self)
901 # Default path used in OSS Spark / for non-credential …Run Code Online (Sandbox Code Playgroud)