use*_*549 26 rename apache-spark apache-spark-sql pyspark
我想使用spark withColumnRenamed函数更改两列的名称.当然,我可以写:
data = sqlContext.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
data = (data
.withColumnRenamed('x1','x3')
.withColumnRenamed('x2', 'x4'))
Run Code Online (Sandbox Code Playgroud)
但我想一步到位(有新名单的列表/元组).不幸的是,这不是:
data = data.withColumnRenamed(['x1', 'x2'], ['x3', 'x4'])
Run Code Online (Sandbox Code Playgroud)
也不是这样
data = data.withColumnRenamed(('x1', 'x2'), ('x3', 'x4'))
Run Code Online (Sandbox Code Playgroud)
工作中.有可能这样做吗?
zer*_*323 47
无法使用单个withColumnRenamed呼叫.
你可以使用DataFrame.toDF方法*
data.toDF('x3', 'x4')
Run Code Online (Sandbox Code Playgroud)
要么
new_names = ['x3', 'x4']
data.toDF(*new_names)
Run Code Online (Sandbox Code Playgroud)也可以简单地重命名select:
from pyspark.sql.functions import col
mapping = dict(zip(['x1', 'x2'], ['x3', 'x4']))
data.select([col(c).alias(mapping.get(c, c)) for c in data.columns])
Run Code Online (Sandbox Code Playgroud)同样在Scala中你可以:
重命名所有列:
val newNames = Seq("x3", "x4")
data.toDF(newNames: _*)
Run Code Online (Sandbox Code Playgroud)从映射重命名为select:
val mapping = Map("x1" -> "x3", "x2" -> "x4")
df.select(
df.columns.map(c => df(c).alias(mapping.get(c).getOrElse(c))): _*
)
Run Code Online (Sandbox Code Playgroud)
或者foldLeft+withColumnRenamed
mapping.foldLeft(data){
case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName)
}
Run Code Online (Sandbox Code Playgroud)*不要混淆RDD.toDF哪个不是可变参数函数,并将列名作为列表,
Tus*_*lhe 15
如果打印执行计划,为什么要在单行中执行它实际上仅在单行中完成
data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
data = (data
.withColumnRenamed('x1','x3')
.withColumnRenamed('x2', 'x4'))
data.explain()
Run Code Online (Sandbox Code Playgroud)
输出
== Physical Plan ==
*(1) Project [x1#1548L AS x3#1552L, x2#1549L AS x4#1555L]
+- Scan ExistingRDD[x1#1548L,x2#1549L]
Run Code Online (Sandbox Code Playgroud)
如果你想用一个列表元组来做,你可以使用一个简单的 map 函数
data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
new_names = [("x1","x3"),("x2","x4")]
data = data.select(list(
map(lambda old,new:F.col(old).alias(new),*zip(*new_names))
))
data.explain()
Run Code Online (Sandbox Code Playgroud)
仍然有相同的计划
输出
== Physical Plan ==
*(1) Project [x1#1650L AS x3#1654L, x2#1651L AS x4#1655L]
+- Scan ExistingRDD[x1#1650L,x2#1651L]
Run Code Online (Sandbox Code Playgroud)
Dan*_*cio 10
从pyspark 3.4.0开始,您可以使用该withColumnsRenamed()方法一次重命名多个列。它将现有列名称和相应的所需列名称的映射作为输入。
df = df.withColumnsRenamed({
"x1": "x3",
"x2": "x4"
})
Run Code Online (Sandbox Code Playgroud)
该方法同时重命名两列。"x1"请注意,如果当前数据帧架构中不存在列(例如),则不会引发错误。相反,它只是被忽略。
小智 9
您还可以使用Dictionary 来迭代要重命名的列。
样本:
a_dict = {'sum_gb': 'sum_mbUsed', 'number_call': 'sum_call_date'}
for key, value in a_dict.items():
df= df.withColumnRenamed(value,key)
Run Code Online (Sandbox Code Playgroud)
我也找不到一个简单的pyspark解决方案,所以只建立了我自己的解决方案,类似于熊猫df.rename(columns={'old_name_1':'new_name_1', 'old_name_2':'new_name_2'}).
def rename_columns(df, columns):
if isinstance(columns, dict):
for old_name, new_name in columns.items():
df = df.withColumnRenamed(old_name, new_name)
return df
else:
raise ValueError("'columns' should be a dict, like {'old_name_1':'new_name_1', 'old_name_2':'new_name_2'}")
Run Code Online (Sandbox Code Playgroud)
所以你的解决方案看起来像 data = rename_columns(data, {'x1': 'x3', 'x2': 'x4'})
它为我节省了一些代码,希望它也能帮到你.
如果您想使用带有前缀的相同列名重命名多个列,这应该有效
df.select([f.col(c).alias(PREFIX + c) for c in df.columns])
Run Code Online (Sandbox Code Playgroud)
我的所有 pyspark 程序都有这个 hack:
import pyspark
def rename_sdf(df, mapper={}, **kwargs_mapper):
''' Rename column names of a dataframe
mapper: a dict mapping from the old column names to new names
Usage:
df.rename({'old_col_name': 'new_col_name', 'old_col_name2': 'new_col_name2'})
df.rename(old_col_name=new_col_name)
'''
for before, after in mapper.items():
df = df.withColumnRenamed(before, after)
for before, after in kwargs_mapper.items():
df = df.withColumnRenamed(before, after)
return df
pyspark.sql.dataframe.DataFrame.rename = rename_sdf
Run Code Online (Sandbox Code Playgroud)
现在您可以轻松地以 pandas 方式重命名任何 Spark 数据框!
df.rename({'old1':'new1', 'old2':'new2'})
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
49743 次 |
| 最近记录: |