在spark数据帧中创建子字符串列

Question

在spark数据帧中创建子字符串列

J S*_*ith 6 scala apache-spark spark-dataframe

我想获取一个json文件并映射它,以便其中一列是另一列的子串.例如,取左表并生成右表:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  hello  |

Run Code Online (Sandbox Code Playgroud)

我可以使用spark-sql语法来完成这个但是如何使用内置函数来完成？

Answer 1

pas*_*701 13

可以使用这样的陈述

import org.apache.spark.sql.functions._

Run Code Online (Sandbox Code Playgroud)

dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))

Answer 2

soo*_*ote 6

您将使用该withColumn功能

import org.apache.spark.sql.functions.{ udf, col }
def substringFn(str: String) = your substring code
val substring = udf(substringFn _)
dataframe.withColumn("b", substring(col("a"))

Run Code Online (Sandbox Code Playgroud)

UDF 很糟糕，因为根据您在其中执行的操作，查询规划器/优化器可能无法“看穿”它。 (2认同)

Answer 3

Ign*_*rre 6

只是为了丰富现有的答案。如果您对字符串列的右侧部分感兴趣。那是：

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  world  |

Run Code Online (Sandbox Code Playgroud)

您应该使用负索引：

dataFrame.select(col("a"), substring_index(col("a"), ",", -1).as("b"))

Run Code Online (Sandbox Code Playgroud)

Answer 4

小智 5

假设您具有以下数据框：

import spark.implicits._
import org.apache.spark.sql.functions._

var df = sc.parallelize(Seq(("foobar", "foo"))).toDF("a", "b")

+------+---+
|     a|  b|
+------+---+
|foobar|foo|
+------+---+

Run Code Online (Sandbox Code Playgroud)

您可以从第一列中子集一个新列，如下所示：

df = df.select(col("*"), substring(col("a"), 4, 6).as("c"))

+------+---+---+
|     a|  b|  c|
+------+---+---+
|foobar|foo|bar|
+------+---+---+

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，7 月前
查看次数：	24708 次
最近记录：	7 年前