Bry*_*ind 1 row-number user-defined-functions dataframe pyspark
我有一个数据框,我需要获取特定行的行号/索引。我想添加一个新行,使其包括字母以及行号/索引,例如。“ A-1”,“ B-2”
#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
Run Code Online (Sandbox Code Playgroud)
带输出
+------+---------+
|Letter|distances|
+------+---------+
| A| 20|
| B| 30|
| D| 80|
+------+---------+
Run Code Online (Sandbox Code Playgroud)
我希望新产品是这样的,
+------+---------------+
|Letter|distances|index|
+------+---------------+
| A| 20|A - 1|
| B| 30|B - 2|
| D| 80|D - 3|
+------+---------------+
Run Code Online (Sandbox Code Playgroud)
这是我一直在努力的功能
def cate(letter):
return letter + " - " + #index
a.withColumn("index", cate(a["Letter"])).show()
Run Code Online (Sandbox Code Playgroud)
由于您想使用UDF(仅)实现结果,因此我们尝试一下
from pyspark.sql.functions import udf, monotonically_increasing_id
from pyspark.sql.types import StringType
#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
def cate(letter, idx):
return letter + " - " + str(idx)
cate_udf = udf(cate, StringType())
a = a.withColumn("temp_index", monotonically_increasing_id())
a = a.\
withColumn("index", cate_udf(a.Letter, a.temp_index)).\
drop("temp_index")
a.show()
Run Code Online (Sandbox Code Playgroud)
输出为:
+------+---------+--------------+
|Letter|distances| index|
+------+---------+--------------+
| A| 20| A - 0|
| B| 30|B - 8589934592|
| D| 80|D - 8589934593|
+------+---------+--------------+
Run Code Online (Sandbox Code Playgroud)