Sha*_*V C 7 scala apache-spark apache-spark-sql
我有一个数据框。我需要计算列中字符串值的最大长度并打印该值及其长度。
我写了下面的代码,但这里的输出只是最大长度,而不是其对应的值。这是如何使用scala从数据框中获取字符串列的最大长度?确实帮助我获得了以下查询。
df.agg(max(length(col("city")))).show()
Run Code Online (Sandbox Code Playgroud)
row_number()按length('city) desc订单使用窗口函数。
然后仅过滤掉该first row_number列并将length('city)列添加到数据框中。
Ex:
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
.toDF("city","num","country")
val win=Window.orderBy(length('city).desc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win))
.filter('rn===1)
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1 |US |3 |1 |
+----+---+-------+-------+---+
Run Code Online (Sandbox Code Playgroud)
(或者)
In spark-sql:
df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC| 1| US| 3| 1|
+----+---+-------+-------+---+
Run Code Online (Sandbox Code Playgroud)
更新:
查找最小值、最大值:
val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win_desc))
.withColumn("rn1",row_number().over(win_asc))
.filter('rn===1 || 'rn1 === 1)
.show(false)
Run Code Online (Sandbox Code Playgroud)
结果:
+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A |1 |US |1 |3 |1 | //min value of string
|ABC |1 |US |3 |1 |3 | //max value of string
+----+---+-------+-------+---+---+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
14013 次 |
| 最近记录: |