如何使用Spark 2屏蔽列?

She*_*har 0 scala apache-spark apache-spark-sql apache-spark-2.0

我有一些表需要掩盖其某些列。每个表要屏蔽的列各不相同,我正在从application.conf文件中读取这些列。

例如,对于雇员表,如下所示

+----+------+-----+---------+
| id | name | age | address |
+----+------+-----+---------+
| 1  | abcd | 21  | India   |
+----+------+-----+---------+
| 2  | qazx | 42  | Germany |
+----+------+-----+---------+
Run Code Online (Sandbox Code Playgroud)

如果我们要屏蔽名称和年龄列,那么我将按顺序获取这些列。

val mask = Seq("name", "age")
Run Code Online (Sandbox Code Playgroud)

屏蔽后的期望值为:

+----+----------------+----------------+---------+
| id | name           | age            | address |
+----+----------------+----------------+---------+
| 1  | *** Masked *** | *** Masked *** | India   |
+----+----------------+----------------+---------+
| 2  | *** Masked *** | *** Masked *** | Germany |
+----+----------------+----------------+---------+
Run Code Online (Sandbox Code Playgroud)

如果我有雇员表一个数据框,那么屏蔽这些列的方法是什么?

如果我有payment如下所示的表,并且想要遮罩namesalary列,那么我会在Sequence中获得遮罩列

+----+------+--------+----------+
| id | name | salary | tax_code |
+----+------+--------+----------+
| 1  | abcd | 12345  | KT10     |
+----+------+--------+----------+
| 2  | qazx | 98765  | AD12d    |
+----+------+--------+----------+
Run Code Online (Sandbox Code Playgroud)
val mask = Seq("name", "salary")
Run Code Online (Sandbox Code Playgroud)

我尝试了类似的方法,mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )但没有返回任何结果。


感谢@philantrovert,我找到了解决方案。这是我使用的解决方案:

def maskData(base: DataFrame, maskColumns: Seq[String]) = {
    val maskExpr = base.columns.map { col => if(maskColumns.contains(col)) s"'*** Masked ***' as ${col}" else col }
    base.selectExpr(maskExpr: _*)
}
Run Code Online (Sandbox Code Playgroud)

Sha*_*ica 5

最简单,最快的方法是使用,withColumn并简单地用覆盖列中的值"*** Masked ***"。使用小型示例数据框

val df = spark.sparkContext.parallelize( Seq (
  (1, "abcd", 12345, "KT10" ),
  (2, "qazx", 98765, "AD12d")
)).toDF("id", "name", "salary", "tax_code")
Run Code Online (Sandbox Code Playgroud)

如果您要屏蔽的列数少且具有已知名称,则只需执行以下操作:

val mask = Seq("name", "salary")

df.withColumn("name", lit("*** Masked ***"))
  .withColumn("salary", lit("*** Masked ***"))
Run Code Online (Sandbox Code Playgroud)

否则,您需要创建一个循环:

var df2 = df
for (col <- mask){
  df2 = df2.withColumn(col, lit("*** Masked ***"))
}
Run Code Online (Sandbox Code Playgroud)

这两种方法都会给您这样的结果:

+---+--------------+--------------+--------+
| id|          name|        salary|tax_code|
+---+--------------+--------------+--------+
|  1|*** Masked ***|*** Masked ***|    KT10|
|  2|*** Masked ***|*** Masked ***|   AD12d|
+---+--------------+--------------+--------+
Run Code Online (Sandbox Code Playgroud)