Bha*_*esh 1 scala dataframe apache-spark apache-spark-sql
我有一个列“年龄”的字符串类型的数据框,我想获得一个包含字符串格式范围的新列
范围如下
[-1, 12, 17, 24, 34, 44, 54, 64, 100, 1000]
例如输入值
Age
=====
-1
12
18
28
38
46
======
Run Code Online (Sandbox Code Playgroud)
需要输出
Age Age-Range
===== =========
-1 (-1,12)
12 (-1,12)
18 (12-17)
28 (24-34)
38 (34-44)
46 (44-54)
====== ==========
Run Code Online (Sandbox Code Playgroud)
任何建议或帮助都受到高度赞赏
这是一个快速的建议,我希望它有所帮助:
case class AgeRange(lowerBound: Int, upperBound: Int) {
def contains(value: Int): Boolean = value >= lowerBound && value < upperBound
}
val rangeList = List(-1, 12, 17, 24, 34, 44, 54, 64, 100, 1000)
val ranges = rangeList.sliding(2).map((list => AgeRange(list(0), list(1)))).toList
val dataset = Seq("-1", "12", "18", "28", "38", "46").toDS
def findRange(value: Int, ageRanges: List[AgeRange]): Option[AgeRange] = ageRanges.find(_.contains(value))
// With UDF
def myUdf(ageRanges: List[AgeRange]) = udf{
i: Int => findRange(i, ageRanges)
}
val result1 = dataset.toDF("age").withColumn("age_range", myUdf(ranges)(col("age").cast("int")))
// With map
val result2 = dataset.map {
i: String => (i, findRange(i.toInt, ranges))
}.toDF("age", "age_range")
Run Code Online (Sandbox Code Playgroud)
导致:
result1: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>]
result2: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>]
+---+---------+
|age|age_range|
+---+---------+
| -1| [-1,12]|
| 12| [12,17]|
| 18| [17,24]|
| 28| [24,34]|
| 38| [34,44]|
| 46| [44,54]|
+---+---------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3641 次 |
| 最近记录: |