替换Spark数据框中的列名称的特殊字符

PRI*_*A M 1 replace scala apache-spark-sql

我有我的输入spark-dataframe命名df为,

+---------------+----------------+-----------------------+
|Main_CustomerID|126+ Concentrate|2.5 Ethylhexyl_Acrylate|
+---------------+----------------+-----------------------+
|         725153|             3.0|                    2.0|
|         873008|             4.0|                    1.0|
|         625109|             1.0|                    0.0|
+---------------+----------------+-----------------------+
Run Code Online (Sandbox Code Playgroud)

我需要从以df下列名称中删除特殊字符,

  • 去掉 +

  • 替换为 underscore

  • 替换dotunderscore

所以我df应该像

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|         725153|            3.0|                    2.0|
|         873008|            4.0|                    1.0|
|         625109|            1.0|                    0.0|
+---------------+---------------+-----------------------+
Run Code Online (Sandbox Code Playgroud)

使用Scala,我已经做到了,

var tableWithColumnsRenamed = df

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\+", ""))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

df = tableWithColumnsRenamed
Run Code Online (Sandbox Code Playgroud)

我用的时候

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    .withColumnRenamed(field, field.replaceAll("\\+", ""))
    .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }
Run Code Online (Sandbox Code Playgroud)

我得到了第一列名称,126 Concentrate而不是126_Concentrate

但是我不喜欢3 for这个替换循环。我可以得到解决方案吗?

Ram*_*jan 8

您可以使用withColumnRenamed regex replaceAllInfoldLeft如下

val columns = df.columns

val regex = """[+._, ]+"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "_"))

val resultDF = replacingColumns.zip(columns).foldLeft(df){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}

resultDF.show(false)
Run Code Online (Sandbox Code Playgroud)

这应该给你

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|725153         |3.0            |2.0                    |
|873008         |4.0            |1.0                    |
|625109         |1.0            |0.0                    |
+---------------+---------------+-----------------------+
Run Code Online (Sandbox Code Playgroud)

我希望答案有帮助


Cha*_*Ray 6

df
  .columns
  .foldLeft(df){(newdf, colname) =>
    newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
  }
  .show
Run Code Online (Sandbox Code Playgroud)