Spark数据帧使用随机数据添加新列

Question

我想在数据框中添加一个新列,其值由0或1组成.我使用了'randint'函数,

from random import randint

df1 = df.withColumn('isVal',randint(0,1))

但我得到以下错误,

/spark/python/pyspark/sql/dataframe.py",第1313行,在withColumn断言isinstance(col,Column)中,"col应该是列"AssertionError:col应该是Column

如何使用自定义函数或randint函数为列生成随机值？

Answer 1

对于从 5 到 10 的整数值也有类似的问题。我使用了以下rand()函数pyspark.sql.functions

from pyspark.sql.functions import *
df1 = df.withColumn("random", round(rand()*(10-5)+5,0))

Answer 2

你正在使用python内置随机.这将返回一个常量的特定值(返回值).

如错误消息所示,我们期望一个表示表达式的列.

要做到这一点:

from pyspark.sql.functions import rand,when
df1 = df.withColumn('isVal', when(rand() > 0.5, 1).otherwise(0))