我正在尝试对Pyspark数据帧上的值进行旋转后得到的列的别名.这里的问题是我没有正确设置我在别名调用中放置的列名.
一个具体的例子:
从此数据框开始:
import pyspark.sql.functions as func
df = sc.parallelize([
(217498, 100000001, 'A'), (217498, 100000025, 'A'), (217498, 100000124, 'A'),
(217498, 100000152, 'B'), (217498, 100000165, 'C'), (217498, 100000177, 'C'),
(217498, 100000182, 'A'), (217498, 100000197, 'B'), (217498, 100000210, 'B'),
(854123, 100000005, 'A'), (854123, 100000007, 'A')
]).toDF(["user_id", "timestamp", "actions"])
Run Code Online (Sandbox Code Playgroud)
这使
+-------+--------------------+------------+
|user_id| timestamp | actions |
+-------+--------------------+------------+
| 217498| 100000001| 'A' |
| 217498| 100000025| 'A' |
| 217498| 100000124| 'A' |
| 217498| 100000152| 'B' |
| 217498| 100000165| 'C' |
| …Run Code Online (Sandbox Code Playgroud) 目前我正在试图提取系列连续出现在PySpark数据帧和订单/对他们进行排名,如下图所示(为方便起见,我已经下令初始数据框user_id和timestamp):
df_ini
Run Code Online (Sandbox Code Playgroud)
+-------+--------------------+------------+
|user_id| timestamp | actions |
+-------+--------------------+------------+
| 217498| 100000001| 'A' |
| 217498| 100000025| 'A' |
| 217498| 100000124| 'A' |
| 217498| 100000152| 'B' |
| 217498| 100000165| 'C' |
| 217498| 100000177| 'C' |
| 217498| 100000182| 'A' |
| 217498| 100000197| 'B' |
| 217498| 100000210| 'B' |
| 854123| 100000005| 'A' |
| 854123| 100000007| 'A' |
| etc.
Run Code Online (Sandbox Code Playgroud)
至 :
expected df_transformed
Run Code Online (Sandbox Code Playgroud)
+-------+------------+------------+------------+
|user_id| actions | nb_of_occ | …Run Code Online (Sandbox Code Playgroud)