是否可以在Spark Dataframe Column中存储numpy数组？

Question

是否可以在Spark Dataframe Column中存储numpy数组？

Tha*_*gor 7 numpy pyspark spark-dataframe

我有一个dataframe,我应用了一个功能.此函数返回numpy array如下代码:

create_vector_udf = udf(create_vector, ArrayType(FloatType()))
dataframe = dataframe.withColumn('vector', create_vector_udf('text'))
dmoz_spark_df.select('lang','url','vector').show(20)

Run Code Online (Sandbox Code Playgroud)

现在火花似乎不满意这个并且不接受ArrayType(FloatType()) 我收到以下错误消息: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

我可以只numpyarray.tolist()返回它的列表版本,但显然我总是要重新创建它,array如果我想用它numpy.

那么有没有来存储方式numpy array 的dataframe column？

Answer 1

pis*_*all 1

问题的根源在于从 UDF 返回的对象不符合声明的类型。create_vector不仅必须返回numpy.ndarray，还必须将数字转换为与 DataFrame API 不兼容的相应 NumPy 类型。

唯一的选择是使用这样的东西：

udf(lambda x: create_vector(x).tolist(), ArrayType(FloatType()))

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，4 月前
查看次数：	1108 次
最近记录：	6 年前