Car*_*ues 4 python apache-spark apache-spark-sql pyspark
我有一个多列 pyspark 数据框,我需要将字符串类型转换为正确的类型,例如:
我目前正在这样做
df = df.withColumn(col_name, col(col_name).cast('float') \
.withColumn(col_id, col(col_id).cast('int') \
.withColumn(col_city, col(col_city).cast('string') \
.withColumn(col_date, col(col_date).cast('date') \
.withColumn(col_code, col(col_code).cast('bigint')
Run Code Online (Sandbox Code Playgroud)
是否可以创建一个包含类型的列表并将其立即传递到所有列?
您只需要有一些映射作为字典或类似的东西,然后生成正确的select语句(您可以使用withColumn,但通常它会导致性能问题)。像这样的事情:
import pyspark.sql.functions as F
mapping = {'col1':'float', ....}
df = .... # your input data
rest_cols = [F.col(cl) for cl in df.columns if cl not in mapping]
conv_cols = [F.col(cl_name).cast(cl_type).alias(cl_name)
for cl_name, cl_type in mapping.items()
if cl_name in df.columns]
conv_df.select(*rest_cols, *conv_cols)
Run Code Online (Sandbox Code Playgroud)