And*_*rey 3 python pandas pyspark
我想将几个大型 Pandas 数据帧转换为 Spark 数据帧,然后操作并合并它们,如下所示:
import pandas as pd
from pyspark import SparkContext,SQLContext
df1 = pd.read_csv('data1.cat',delim_whitespace=True)
df2 = pd.read_csv('data2.cat',delim_whitespace=True)
sc = SparkContext()
sql = SQLContext(sc)
spark_df1 = sql.createDataFrame(df1)
spark_df2 = sql.createDataFrame(df2)
Run Code Online (Sandbox Code Playgroud)
但是出了问题,我收到以下错误:
File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pyspark/sql/context.py", line 307, in createDataFrame
return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pyspark/sql/session.py", line 724, in createDataFrame
data = self._convert_from_pandas(data, schema, timezone)
File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pyspark/sql/session.py", line 487, in _convert_from_pandas
np_records = pdf.to_records(index=False)
File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pandas/core/frame.py", line 1839, in to_records
return np.rec.fromarrays(arrays, dtype={"names": names, "formats": formats})
File "/home/user/.local/lib/python3.6/site-packages/numpy/core/records.py", line 617, in fromarrays
descr = sb.dtype(dtype)
ValueError: name already used as a name or title
Run Code Online (Sandbox Code Playgroud)
是否可以像这样在同一个会话中创建多个 Spark 数据帧?
小智 14
抛出该错误是因为您的 pandas DataFrame ( df1of df2) 的 2 个或更多列具有相同的列名。更改列名称或删除它。
谢谢,萨加尔