我有一个包含多列的spark DataFrame.我想基于一列对行进行分组,然后为每个组找到第二列的模式.使用pandas DataFrame,我会做这样的事情:
rand_values = np.random.randint(max_value,
size=num_values).reshape((num_values/2, 2))
rand_values = pd.DataFrame(rand_values, columns=['x', 'y'])
rand_values['x'] = rand_values['x'] > max_value/2
rand_values['x'] = rand_values['x'].astype('int32')
print(rand_values)
## x y
## 0 0 0
## 1 0 4
## 2 0 1
## 3 1 1
## 4 1 2
def mode(series):
return scipy.stats.mode(series['y'])[0][0]
rand_values.groupby('x').apply(mode)
## x
## 0 4
## 1 1
## dtype: int64
Run Code Online (Sandbox Code Playgroud)
在pyspark中,我能够找到单列的模式
df = sql_context.createDataFrame(rand_values)
def mode_spark(df, column):
# Group by column and count the number of occurrences
# of …Run Code Online (Sandbox Code Playgroud)