映射Spark行中数组的每个值

laz*_*wiz 3 scala dataframe apache-spark apache-spark-sql

我有一个json数据集,格式如下,每行一个条目.

 { "sales_person_name" : "John", "products" : ["apple", "mango", "guava"]}
 { "sales_person_name" : "Tom", "products" : ["mango", "orange"]}
 { "sales_person_name" : "John", "products" : ["apple", "banana"]}
 { "sales_person_name" : "Steve", "products" : ["apple", "mango"]}
 { "sales_person_name" : "Tom", "products" : ["mango", "guava"]}
Run Code Online (Sandbox Code Playgroud)

我想知道谁卖了最大的芒果等等.因此,我想将文件加载到dataframe,并为每个事务的数组中的每个产品值发出(key,value)对(product,name).

var df = spark.read.json("s3n://sales-data.json")
df.printSchema()
root
 |-- sales_person_name: string (nullable = true)
 |-- products: array (nullable = true)

var nameProductsMap = df.select("sales_person_name",  "products").show()
+-----------------+--------------------+
|sales_person_name|   products         |
+-----------------+--------------------+
|             John|[mango, apple,...   |
|              Tom|[mango, orange,...  |
|             John|[apple, banana...   | 

var resultMap = df.select("products", "sales_person_name")
                  .map(r => (r(1), r(0)))
                  .show()  //This is where I am stuck.
Run Code Online (Sandbox Code Playgroud)

我无法找出爆炸()行(0)的正确方法,并使用row(1)值发出一次所有值.任何人都可以提出建议.谢谢!

预期产量:

Mango : John(4), Tom(2), Greg(1)... 
Banana: Tom(5), John(2), ...
Run Code Online (Sandbox Code Playgroud)

Tza*_*har 5

val exploded = df.explode("products", "product") { a: mutable.WrappedArray[String] => a }
val result = exploded.drop("products")
result.show()
Run Code Online (Sandbox Code Playgroud)

打印:

+-----------------+-------+
|sales_person_name|product|
+-----------------+-------+
|             John|  apple|
|             John|  mango|
|             John|  guava|
|              Tom|  mango|
|              Tom| orange|
|             John|  apple|
|             John| banana|
|            Steve|  apple|
|            Steve|  mango|
|              Tom|  mango|
|              Tom|  guava|
+-----------------+-------+
Run Code Online (Sandbox Code Playgroud)