如何在 Scala/Spark 数据框中的每一行中使用 withColumn 和条件

Ath*_*kur 1 scala apache-spark apache-spark-sql

我有以下格式的数据框

+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+
|DataPartition    |TimeStamp                |FFAction|!||IdentifierValue_effectiveFrom|IdentifierValue_effectiveTo|IdentifierValue_identifierEntityId|IdentifierValue_identifierEntityTypeId|IdentifierValue_identifierTypeId|
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+
|SelfSourcedPublic|2018-03-05T11:54:18+00:00|I|!|       |1900-01-01T00:00:00+00:00    |9999-12-31T00:00:00+00:00  |4295903126                        |404010                                |320150                          |
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+
Run Code Online (Sandbox Code Playgroud)

我想在下面的列中添加带有条件的额外列

IdentifierValue_identifierEntityTypeId
Run Code Online (Sandbox Code Playgroud)

添加具有以下条件的额外列分区

如果 IdentifierValue_identifierEntityTypeId =1001371402 那么分区 =Repno2FundamentalSeries 否则如果 IdentifierValue_identifierEntityTypeId404010 那么分区 = Repno2Organization

这就是我正在努力实现的目标

 val temp = temp1.withColumn("Partition", when($"IdentifierValue_identifierEntityTypeId" === "404010", 0).otherwise("Repno2FundamentalSeries"))
    temp.show(false)
Run Code Online (Sandbox Code Playgroud)

我的输出低于输出,但得到的值为零

IdentifierValue_identifierEntityTypeId
Run Code Online (Sandbox Code Playgroud)

我是 Scala 的新手,所以提出了基本问题

对于列上的多个条件如何写 when 和 else 。这对我不起作用

线程“main”中的异常 java.lang.IllegalArgumentException:otherwise() 只能在先前由 when() 生成的列上应用一次

val dataMain = dataMain1.withColumn(
      "Partition",
      when($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "Fundamental", "Instrument2Fundamental")
        .otherwise(when($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Instrument2FundamentalSeries"))
        .otherwise(when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "Fundamental", "Organization2Fundamental"))
        .otherwise(when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Organization2FundamentalSeries"))
        )
Run Code Online (Sandbox Code Playgroud)

Sha*_*ala 7

根据您提供的条件,您应该如下更改 when 条件。

如果 IdentifierValue_identifierEntityTypeId =1001371402 那么分区 =Repno2FundamentalSeries 否则如果 IdentifierValue_identifierEntityTypeId404010 那么分区 = Repno2Organization

df1.withColumn("Partition",
  when($"IdentifierValue_identifierEntityTypeId" === "1001371402", "Repno2FundamentalSeries")
    .otherwise("Repno2Organization")
)
Run Code Online (Sandbox Code Playgroud)

输出:

+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+-----------------------+
|DataPartition    |TimeStamp                |FFAction|!||IdentifierValue_effectiveFrom|IdentifierValue_effectiveTo|IdentifierValue_identifierEntityId|IdentifierValue_identifierEntityTypeId|IdentifierValue_identifierTypeId|Partition              |
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+-----------------------+
|SelfSourcedPublic|2018-03-05T11:54:18+00:00|I||!       |1900-01-01T00:00:00+00:00    |9999-12-31T00:00:00+00:00  |4295903126                        |404010                                |320150                          |Repno2FundamentalSeries|
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+-----------------------+
Run Code Online (Sandbox Code Playgroud)

编辑:

这是您编写嵌套的方式 When

val dataMain = df.withColumn(
"Partition",
when(($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "Fundamental"), "Instrument2Fundamental")
  .otherwise(
    when($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Instrument2FundamentalSeries")
      .otherwise(
        when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "Fundamental", "Organization2Fundamental")
          .otherwise(when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Organization2FundamentalSeries")
          )
      )
  )
Run Code Online (Sandbox Code Playgroud)

)

希望这可以帮助