更新spark中的dataframe列

Question

更新spark中的dataframe列

Luk*_*uke 64 python apache-spark apache-spark-sql pyspark spark-dataframe

查看新的spark数据帧api,目前还不清楚是否可以修改数据帧列.

我怎么会去改变行的值x列y一个数据帧的？

在pandas这将是df.ix[x,y] = new_value

编辑:合并下面所述的内容,您无法修改现有数据框,因为它是不可变的,但您可以返回具有所需修改的新数据框.

如果您只想根据条件替换列中的值,例如np.where:

from pyspark.sql import functions as F

update_func = (F.when(F.col('update_col') == replace_val, new_value)
                .otherwise(F.col('update_col')))
df = df.withColumn('new_column_name', update_func)

Run Code Online (Sandbox Code Playgroud)

如果要对列执行某些操作并创建添加到数据帧的新列:

import pyspark.sql.functions as F
import pyspark.sql.types as T

def my_func(col):
    do stuff to column here
    return transformed_value

# if we assume that my_func returns a string
my_udf = F.UserDefinedFunction(my_func, T.StringType())

df = df.withColumn('new_column_name', my_udf('update_col'))

Run Code Online (Sandbox Code Playgroud)

如果您希望新列与旧列具有相同的名称,则可以添加其他步骤:

df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')

Run Code Online (Sandbox Code Playgroud)

Answer 1

kar*_*son 63

虽然您无法修改列,但您可以对列进行操作并返回反映该更改的新DataFrame.为此,您首先要创建一个UserDefinedFunction实现要应用的操作,然后有选择地将该函数应用于目标列.在Python中:

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

name = 'target_column'
udf = UserDefinedFunction(lambda x: 'new_value', StringType())
new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])

Run Code Online (Sandbox Code Playgroud)

new_df现在具有相同的模式old_df(假设它old_df.target_column也是类型StringType)但列中的所有值target_column都是new_value.

还有:`new_df = old_df.withColumn('target_column',udf(df.name))` (22认同)
是的，这应该可以正常工作。请记住，UDF 只能将列作为参数。如果要将其他数据传递到函数中，则必须先部分应用它。 (2认同)

Answer 2

Pau*_*aul 40

通常在更新列时,我们希望将旧值映射到新值.这是在没有UDF的情况下在pyspark中执行此操作的方法:

# update df[update_col], mapping old_value --> new_value
from pyspark.sql import functions as F
df = df.withColumn(update_col,
    F.when(df[update_col]==old_value,new_value).
    otherwise(df[update_col])).

Run Code Online (Sandbox Code Playgroud)

Answer 3

maa*_*asg 13

DataFrames基于RDD.RDD是不可变结构,不允许现场更新元素.要更改值,您需要通过使用类似SQL的DSL或RDD操作转换原始DataFrame来创建新的DataFrame map.

强烈推荐的幻灯片:在Spark中为大规模数据科学引入DataFrame.

那么添加的数据帧抽象究竟是什么呢？在表格的行数相同的情况下还不能完成？ (3认同)

Answer 4

rad*_*1st 11

正如maasg所说,您可以根据应用于旧DataFrame的地图结果创建新的DataFrame.df具有两行的给定DataFrame的示例:

val newDf = sqlContext.createDataFrame(df.map(row => 
  Row(row.getInt(0) + SOMETHING, applySomeDef(row.getAs[Double]("y")), df.schema)

Run Code Online (Sandbox Code Playgroud)

请注意,如果列的类型发生更改,则需要为其指定正确的架构,而不是df.schema.查看org.apache.spark.sql.Row可用方法的API :https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html

[更新]或在Scala中使用UDF:

import org.apache.spark.sql.functions._

val toLong = udf[Long, String] (_.toLong)

val modifiedDf = df.withColumn("modifiedColumnName", toLong(df("columnName"))).drop("columnName")

Run Code Online (Sandbox Code Playgroud)

如果列名称需要保持不变,您可以将其重命名:

modifiedDf.withColumnRenamed("modifiedColumnName", "columnName")

Run Code Online (Sandbox Code Playgroud)

Answer 5

DHE*_*RAJ 5

从pyspark.sql.functions导入col，并根据字符串（字符串 a、字符串 b、字符串 c）将第五列更新为整数（0,1,2）到新的 DataFrame 中。

from pyspark.sql.functions import col, when data_frame_temp = data_frame.withColumn("col_5",when(col("col_5") == "string a", 0).when(col("col_5") == "string b", 1).otherwise(2))
Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年前
查看次数：	100187 次
最近记录：	8 年，8 月前