我有以下pySpark数据帧:
+------------------+------------------+--------------------+--------------+-------+
| col1| col2| col3| X| Y|
+------------------+------------------+--------------------+--------------+-------+
|2.1729247374294496| 3.558069532647046| 6.607603368496324| 1| null|
|0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| null|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| null|
| 2.608497168338446| 3.529397129549324| 0.373034222141551| 2| null|
+------------------+------------------+--------------------+--------------+-------+
Run Code Online (Sandbox Code Playgroud)
这是一个相当简单的操作,我可以很容易地用熊猫做.但是,我需要只使用pySpark.
我想做以下(我会写一些伪代码):
在col3 == max(col3)的行中,将Y从null更改为'K'
在剩下的行中,在col1 == max(col1)的行中,将Y从null更改为'Z'
在其余行中,在col1 == min(col1)的行中,将Y从null更改为"U"
在剩下的行中:将Y从null更改为"I".
因此,预期的输出是:
+------------------+------------------+--------------------+--------------+-------+
| col1| col2| col3| X| Y|
+------------------+------------------+--------------------+--------------+-------+
|2.1729247374294496| 3.558069532647046| 6.607603368496324| 1| K|
|0.2654841575294071|1.2633077949463256|0.023578679968183733| 0| U|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554| 3| I|
| 2.608497168338446| 3.529397129549324| 0.373034222141551| 2| Z|
+------------------+------------------+--------------------+--------------+-------+
Run Code Online (Sandbox Code Playgroud)
完成后,我需要使用此表作为另一个表的查找:
+--------------------+--------+-----+------------------+--------------+------------+
| x1| x2| x3| x4| X| d|
+--------------------+--------+-----+------------------+--------------+------------+
|0057f68a-6330-42a...| 2876| …Run Code Online (Sandbox Code Playgroud)