pyspark:创建超过 1 个数据帧失败

And*_*rey 3 python pandas pyspark

我想将几个大型 Pandas 数据帧转换为 Spark 数据帧,然后操作并合并它们,如下所示:

import pandas as pd
from pyspark import SparkContext,SQLContext

df1 = pd.read_csv('data1.cat',delim_whitespace=True)
df2 = pd.read_csv('data2.cat',delim_whitespace=True)

sc = SparkContext()
sql = SQLContext(sc)
spark_df1 = sql.createDataFrame(df1)
spark_df2 = sql.createDataFrame(df2)
Run Code Online (Sandbox Code Playgroud)

但是出了问题,我收到以下错误:

  File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pyspark/sql/context.py", line 307, in createDataFrame
    return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
  File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pyspark/sql/session.py", line 724, in createDataFrame
    data = self._convert_from_pandas(data, schema, timezone)
  File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pyspark/sql/session.py", line 487, in _convert_from_pandas
    np_records = pdf.to_records(index=False)
  File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pandas/core/frame.py", line 1839, in to_records
    return np.rec.fromarrays(arrays, dtype={"names": names, "formats": formats})
  File "/home/user/.local/lib/python3.6/site-packages/numpy/core/records.py", line 617, in fromarrays
    descr = sb.dtype(dtype)
ValueError: name already used as a name or title
Run Code Online (Sandbox Code Playgroud)

是否可以像这样在同一个会话中创建多个 Spark 数据帧?

小智 14

抛出该错误是因为您的 pandas DataFrame ( df1of df2) 的 2 个或更多列具有相同的列名。更改列名称或删除它。

谢谢,萨加尔

  • 你好 Sagar,欢迎来到 StackOverflow!我建议你看一下[如何写一个好的答案](https://stackoverflow.com/help/how-to-answer),这样你就可以确保大多数人都能从中受益! (3认同)