cph*_*sto 4 row-number dataframe pandas apache-spark pyspark
我有一个PySpark DataFrame-
valuesCol = [('Sweden',31),('Norway',62),('Iceland',13),('Finland',24),('Denmark',52)]
df = sqlContext.createDataFrame(valuesCol,['name','id'])
+-------+---+
| name| id|
+-------+---+
| Sweden| 31|
| Norway| 62|
|Iceland| 13|
|Finland| 24|
|Denmark| 52|
+-------+---+
Run Code Online (Sandbox Code Playgroud)
我希望在此DataFrame中添加一行列,这是该行的行号(序列号),如下所示-
我的最终输出应该是:
+-------+---+--------+
| name| id|row_num |
+-------+---+--------+
| Sweden| 31| 1|
| Norway| 62| 2|
|Iceland| 13| 3|
|Finland| 24| 4|
|Denmark| 52| 5|
+-------+---+--------+
Run Code Online (Sandbox Code Playgroud)
我的Spark版本是 2.2
我正在尝试此代码,但无法正常工作-
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
w = Window().orderBy()
df = df.withColumn("row_num", row_number().over(w))
df.show()
Run Code Online (Sandbox Code Playgroud)
我收到一个错误:
AnalysisException: 'Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;'
Run Code Online (Sandbox Code Playgroud)
如果我对它的理解正确,则需要对某些列进行排序,但是我不希望这样,w = Window().orderBy('id')因为那样会重新排序整个DataFrame。
谁能建议如何使用row_number()功能实现上述输出?
您应该为order子句定义列。如果您不需要订购值,请编写一个虚拟值。试试下面;
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("row_num", row_number().over(w))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2512 次 |
| 最近记录: |