相关疑难解决方法(0)

PySpark:计算列子集的行最大值并添加到现有数据帧

我想计算每行的列子集的最大值,并将其添加为现有列的新列Dataframe.

我设法以非常尴尬的方式做到了这一点:

def add_colmax(df,subset_columns,colnm):
     '''
     calculate the maximum of the selected "subset_columns" from dataframe df for each row, 
     new column containing row wise maximum is added to dataframe df. 

     df: dataframe. It must contain subset_columns as subset of columns
     colnm: Name of the new column containing row-wise maximum of subset_columns
     subset_columns: the subset of columns from w
     '''
     from pyspark.sql.functions import monotonicallyIncreasingId
     from pyspark.sql import Row
     def get_max_row_with_None(row):
         return float(np.max(row))

     df_subset = df.select(subset_columns)
     rdd = df_subset.map( get_max_row_with_None)
     df_rowsum …
Run Code Online (Sandbox Code Playgroud)

python apache-spark apache-spark-sql pyspark pyspark-sql

5
推荐指数
1
解决办法
3598
查看次数