我有一个数据框,如下所示:
+------+-------+
| key | label |
+------+-------+
| key1 | a |
| key1 | b |
| key2 | a |
| key2 | a |
| key2 | a |
+------+-------+
Run Code Online (Sandbox Code Playgroud)
我想要在spark中更改countByKeys的版本,该版本返回如下输出:
+------+-------+
| key | count |
+------+-------+
| key1 | 0 |
| key2 | 3 |
+------+-------+
//explanation:
if all labels under a key are same, then return count of all rows under a key
else count for that key is 0
Run Code Online (Sandbox Code Playgroud)
我解决这个问题的方法:
脚步: …