小编Mon*_*ika的帖子

函数到每一行 Spark Dataframe

我有一个带有 2 列 (Report_idCluster_number)的火花数据框 (df )。

我想将函数 ( getClusterInfo) 应用于 df ,它将返回每个集群的名称,即如果集群编号为“3”,那么对于特定的report_id,将写入下面提到的 3 行:

{"cluster_id":"1","influencers":[{"screenName":"A"},{"screenName":"B"},{"screenName":"C"},...]}
{"cluster_id":"2","influencers":[{"screenName":"D"},{"screenName":"E"},{"screenName":"F"},...]}
{"cluster_id":"3","influencers":[{"screenName":"G"},{"screenName":"H"},{"screenName":"E"},...]}
Run Code Online (Sandbox Code Playgroud)

foreach在 df 上使用来应用getClusterInfo函数,但不知道如何将 o/p 转换为 Dataframe ( Report_id, Array[cluster_info])。

这是代码片段:

  df.foreach(row => {
    val report_id = row(0)
    val cluster_no = row(1).toString
    val cluster_numbers = new Range(0, cluster_no.toInt - 1, 1)
    for (cluster <- cluster_numbers.by(1)) {
      val cluster_id = report_id + "_" + cluster
      //get cluster influencers
      val result = getClusterInfo(cluster_id) …
Run Code Online (Sandbox Code Playgroud)

scala dataframe apache-spark apache-spark-sql

3
推荐指数
1
解决办法
1万
查看次数

标签 统计

apache-spark ×1

apache-spark-sql ×1

dataframe ×1

scala ×1