Use*_*d82 1 scala transform hashmap dataframe apache-spark
I want to replace the values of a given df column, using a hashmap but I am struggling with the syntax. Can someone please point me in the right direction or to an existing example? I have searched but not able to find something which sheds light on the exact subject.
Edit:
Imagine a dataframe like shown below:
+-----------+--------+-----------+
| Noun| Pronoun| Adjective|
+-----------+--------+-----------+
| Homer| Simpson|BeerDrinker|
| Marge| Simpson| Housewife|
| Bart| Simpson| Son|
| Lisa| Simpson| Daughter|
|TheSimpsons|Simpsons| Family|
+-----------+--------+-----------+
Run Code Online (Sandbox Code Playgroud)
And I have a map of key-value pairs like shown below:
type ValueMap = scala.collection.mutable.HashMap [String,String]
var mymap = new ValueMap ()
mymap += ("Simpson" -> "Surname")
Run Code Online (Sandbox Code Playgroud)
I want to do an operation (which I am unable to figure out as of yet) and achieve a result like shown below. So basically in the column Pronoun, all the column values which equal Simpson have been replaced by its corresponding value from the map mymap which is Surname
+-----------+--------+-----------+
| Noun| Pronoun| Adjective|
+-----------+--------+-----------+
| Homer| Surname|BeerDrinker|
| Marge| Surname| Housewife|
| Bart| Surname| Son|
| Lisa| Surname| Daughter|
|TheSimpsons|Simpsons| Family|
+-----------+--------+-----------+
Run Code Online (Sandbox Code Playgroud)
使用 UDF 尝试这种方法,
val myMap = Map("Simpson" -> "Surname")
val df = Seq(("Homer","Simpson","BeerDrinker"),("Marge","Simpson","Housewife"),("Bart","Simpson","Son"),("Lisa","Simpson","Daughter"),("TheSimpsons","Simpsons","Family")).toDF("Noun","Pronoun","Adjective")
df.show(false)
-----------+--------+-----------+
|Noun |Pronoun |Adjective |
+-----------+--------+-----------+
|Homer |Simpson |BeerDrinker|
|Marge |Simpson |Housewife |
|Bart |Simpson |Son |
|Lisa |Simpson |Daughter |
|TheSimpsons|Simpsons|Family |
+-----------+--------+-----------+
val getVal = udf((x: String) => myMap.getOrElse(x, x))
val resDF = df.withColumn("Pronoun", getVal($"Pronoun"))
resDF.show(false)
+-----------+--------+-----------+
|Noun |Pronoun |Adjective |
+-----------+--------+-----------+
|Homer |Surname |BeerDrinker|
|Marge |Surname |Housewife |
|Bart |Surname |Son |
|Lisa |Surname |Daughter |
|TheSimpsons|Simpsons|Family |
+-----------+--------+-----------+
Run Code Online (Sandbox Code Playgroud)
让我知道这是否有帮助。
更新:
没有UDF,
将地图作为另一列添加到 DF
val df1 = df.withColumn("map", typedLit(myMap))
val df2 = df1.withColumn("Pronoun", when($"map"($"Pronoun").isNotNull, $"map"($"Pronoun")).otherwise($"Pronoun") ).drop("map")
df2.show(false)
+-----------+--------+-----------+
|Noun |Pronoun |Adjective |
+-----------+--------+-----------+
|Homer |Surname |BeerDrinker|
|Marge |Surname |Housewife |
|Bart |Surname |Son |
|Lisa |Surname |Daughter |
|TheSimpsons|Simpsons|Family |
+-----------+--------+-----------+
Run Code Online (Sandbox Code Playgroud)
另一种简单的方法而不是添加新列,
val colMap = typedLit(myMap)
val df3 = df.withColumn("Pronoun", when(colMap($"Pronoun").isNotNull, colMap($"Pronoun")).otherwise($"Pronoun") )
df3.show(false)
Run Code Online (Sandbox Code Playgroud)