Sae*_*HAH 6 dataframe pandas apache-spark pyspark pyarrow
我想将一个大的 spark 数据框转换为超过 1000000 行的 Pandas。我尝试使用以下代码将 spark 数据帧转换为 Pandas 数据帧:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
result.toPandas()
Run Code Online (Sandbox Code Playgroud)
但是,我得到了错误:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1949 import pyarrow
-> 1950 to_arrow_schema(self.schema)
1951 tables = self._collectAsArrow()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_schema(schema)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in <listcomp>(.0)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_type(dt)
1641 else:
-> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
1643 return arrow_type
TypeError: Unsupported type in conversion to Arrow: VectorUDT
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-138-4e12457ff4d5> in <module>()
1 spark.conf.set("spark.sql.execution.arrow.enabled", "true")
----> 2 result.toPandas()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1962 "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
1963 "to disable this.")
-> 1964 raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
1965 else:
1966 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
RuntimeError: Unsupported type in conversion to Arrow: VectorUDT
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
Run Code Online (Sandbox Code Playgroud)
它不起作用,但如果我将箭头设置为 false,它就起作用了。但它太慢了......知道吗?
Arrow 仅支持一小组类型,而 SparkUserDefinedTypes包括ml和mllib VectorUDTs不支持的类型。
如果您想使用箭头,则必须将数据转换为支持的格式。一种可能的解决方案是扩展Vectors为列 - How to split Vector into columns - using PySpark
您还可以使用to_json方法序列化输出:
from pyspark.sql.functions import to_json
df.withColumn("your_vector_column", to_json("your_vector_column"))
Run Code Online (Sandbox Code Playgroud)
但是如果数据大到足以toPandas成为一个严重的瓶颈,那么我会重新考虑像这样收集数据。
| 归档时间: |
|
| 查看次数: |
4822 次 |
| 最近记录: |