我有一个带有 2 列 (Report_id和Cluster_number)的火花数据框 (df )。
我想将函数 ( getClusterInfo) 应用于 df ,它将返回每个集群的名称,即如果集群编号为“3”,那么对于特定的report_id,将写入下面提到的 3 行:
{"cluster_id":"1","influencers":[{"screenName":"A"},{"screenName":"B"},{"screenName":"C"},...]}
{"cluster_id":"2","influencers":[{"screenName":"D"},{"screenName":"E"},{"screenName":"F"},...]}
{"cluster_id":"3","influencers":[{"screenName":"G"},{"screenName":"H"},{"screenName":"E"},...]}
Run Code Online (Sandbox Code Playgroud)
我foreach在 df 上使用来应用getClusterInfo函数,但不知道如何将 o/p 转换为 Dataframe ( Report_id, Array[cluster_info])。
这是代码片段:
df.foreach(row => {
val report_id = row(0)
val cluster_no = row(1).toString
val cluster_numbers = new Range(0, cluster_no.toInt - 1, 1)
for (cluster <- cluster_numbers.by(1)) {
val cluster_id = report_id + "_" + cluster
//get cluster influencers
val result = getClusterInfo(cluster_id) …Run Code Online (Sandbox Code Playgroud)