如何使用数据帧在 Spark 中查找列的最大字符串长度?

Sha*_*V C 7 scala apache-spark apache-spark-sql

我有一个数据框。我需要计算列中字符串值的最大长度并打印该值及其长度。

我写了下面的代码,但这里的输出只是最大长度,而不是其对应的值。这是如何使用scala从数据框中获取字符串列的最大长度?确实帮助我获得了以下查询。

 df.agg(max(length(col("city")))).show()
Run Code Online (Sandbox Code Playgroud)

Shu*_*Shu 6

row_number()length('city) desc订单使用窗口函数。

然后仅过滤掉该first row_number列并将length('city)列添加到数据框中。

Ex:

val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
       .toDF("city","num","country")

val win=Window.orderBy(length('city).desc)

df.withColumn("str_len",length('city))
  .withColumn("rn", row_number().over(win))
  .filter('rn===1)
  .show(false)

+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1  |US     |3      |1  |
+----+---+-------+-------+---+
Run Code Online (Sandbox Code Playgroud)

(或者)

In spark-sql:

df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC|  1|     US|      3|  1|
+----+---+-------+-------+---+
Run Code Online (Sandbox Code Playgroud)

更新:

查找最小值、最大值:

val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
  .withColumn("rn", row_number().over(win_desc))
  .withColumn("rn1",row_number().over(win_asc))
  .filter('rn===1 || 'rn1 === 1)
  .show(false)
Run Code Online (Sandbox Code Playgroud)

结果:

+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A   |1  |US     |1      |3  |1  | //min value of string
|ABC |1  |US     |3      |1  |3  | //max value of string
+----+---+-------+-------+---+---+
Run Code Online (Sandbox Code Playgroud)