如何根据其他列 spark 的值在 Dataframe 中添加列

Bha*_*esh 1 scala dataframe apache-spark apache-spark-sql

我有一个列“年龄”的字符串类型的数据框,我想获得一个包含字符串格式范围的新列

范围如下

[-1, 12, 17, 24, 34, 44, 54, 64, 100, 1000]

例如输入值

Age
=====  
-1
12
18
28
38
46
======
Run Code Online (Sandbox Code Playgroud)

需要输出

  Age    Age-Range
 =====  ========= 
 -1     (-1,12)
 12     (-1,12)
 18     (12-17) 
 28     (24-34)
 38     (34-44)
 46     (44-54) 
======  ==========
Run Code Online (Sandbox Code Playgroud)

任何建议或帮助都受到高度赞赏

Dan*_*ula 5

这是一个快速的建议,我希望它有所帮助:

case class AgeRange(lowerBound: Int, upperBound: Int) {
  def contains(value: Int): Boolean = value >= lowerBound && value < upperBound
}

val rangeList = List(-1, 12, 17, 24, 34, 44, 54, 64, 100, 1000)
val ranges = rangeList.sliding(2).map((list => AgeRange(list(0), list(1)))).toList
val dataset = Seq("-1", "12", "18", "28", "38", "46").toDS

def findRange(value: Int, ageRanges: List[AgeRange]): Option[AgeRange] = ageRanges.find(_.contains(value))

// With UDF
def myUdf(ageRanges: List[AgeRange]) = udf{
  i: Int => findRange(i, ageRanges) 
}

val result1 = dataset.toDF("age").withColumn("age_range", myUdf(ranges)(col("age").cast("int")))

// With map
val result2 = dataset.map {
  i: String => (i, findRange(i.toInt, ranges))
}.toDF("age", "age_range")
Run Code Online (Sandbox Code Playgroud)

导致:

result1: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>]
result2: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>]
+---+---------+
|age|age_range|
+---+---------+
| -1|  [-1,12]|
| 12|  [12,17]|
| 18|  [17,24]|
| 28|  [24,34]|
| 38|  [34,44]|
| 46|  [44,54]|
+---+---------+
Run Code Online (Sandbox Code Playgroud)