laz*_*wiz 3 scala dataframe apache-spark apache-spark-sql
我有一个json数据集,格式如下,每行一个条目.
{ "sales_person_name" : "John", "products" : ["apple", "mango", "guava"]}
{ "sales_person_name" : "Tom", "products" : ["mango", "orange"]}
{ "sales_person_name" : "John", "products" : ["apple", "banana"]}
{ "sales_person_name" : "Steve", "products" : ["apple", "mango"]}
{ "sales_person_name" : "Tom", "products" : ["mango", "guava"]}
Run Code Online (Sandbox Code Playgroud)
我想知道谁卖了最大的芒果等等.因此,我想将文件加载到dataframe,并为每个事务的数组中的每个产品值发出(key,value)对(product,name).
var df = spark.read.json("s3n://sales-data.json")
df.printSchema()
root
|-- sales_person_name: string (nullable = true)
|-- products: array (nullable = true)
var nameProductsMap = df.select("sales_person_name", "products").show()
+-----------------+--------------------+
|sales_person_name| products |
+-----------------+--------------------+
| John|[mango, apple,... |
| Tom|[mango, orange,... |
| John|[apple, banana... |
var resultMap = df.select("products", "sales_person_name")
.map(r => (r(1), r(0)))
.show() //This is where I am stuck.
Run Code Online (Sandbox Code Playgroud)
我无法找出爆炸()行(0)的正确方法,并使用row(1)值发出一次所有值.任何人都可以提出建议.谢谢!
预期产量:
Mango : John(4), Tom(2), Greg(1)...
Banana: Tom(5), John(2), ...
Run Code Online (Sandbox Code Playgroud)
val exploded = df.explode("products", "product") { a: mutable.WrappedArray[String] => a }
val result = exploded.drop("products")
result.show()
Run Code Online (Sandbox Code Playgroud)
打印:
+-----------------+-------+
|sales_person_name|product|
+-----------------+-------+
| John| apple|
| John| mango|
| John| guava|
| Tom| mango|
| Tom| orange|
| John| apple|
| John| banana|
| Steve| apple|
| Steve| mango|
| Tom| mango|
| Tom| guava|
+-----------------+-------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7156 次 |
| 最近记录: |