小编And*_*rey的帖子

pyspark:创建超过 1 个数据帧失败

我想将几个大型 Pandas 数据帧转换为 Spark 数据帧,然后操作并合并它们,如下所示:

import pandas as pd
from pyspark import SparkContext,SQLContext

df1 = pd.read_csv('data1.cat',delim_whitespace=True)
df2 = pd.read_csv('data2.cat',delim_whitespace=True)

sc = SparkContext()
sql = SQLContext(sc)
spark_df1 = sql.createDataFrame(df1)
spark_df2 = sql.createDataFrame(df2)
Run Code Online (Sandbox Code Playgroud)

但是出了问题,我收到以下错误:

  File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pyspark/sql/context.py", line 307, in createDataFrame
    return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
  File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pyspark/sql/session.py", line 724, in createDataFrame
    data = self._convert_from_pandas(data, schema, timezone)
  File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pyspark/sql/session.py", line 487, in _convert_from_pandas
    np_records = pdf.to_records(index=False)
  File "/home/user/anaconda3/envs/conda_py3.6.8/lib/python3.6/site-packages/pandas/core/frame.py", line 1839, in to_records
    return np.rec.fromarrays(arrays, dtype={"names": names, "formats": formats})
  File "/home/user/.local/lib/python3.6/site-packages/numpy/core/records.py", line 617, in …
Run Code Online (Sandbox Code Playgroud)

python pandas pyspark

3
推荐指数
1
解决办法
1万
查看次数

标签 统计

pandas ×1

pyspark ×1

python ×1