Joe*_*Joe 3 schema dataframe apache-spark pyspark
我有 DF1 的架构:
df1 = spark.read.parquet(load_path1)
df1.printSchema()
root
|-- PRODUCT_OFFERING_ID: string (nullable = true)
|-- CREATED_BY: string (nullable = true)
|-- CREATION_DATE: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
和 DF2:
df2 = spark.read.parquet(load_path2)
df2.printSchema()
root
|-- PRODUCT_OFFERING_ID: decimal(38,10) (nullable = true)
|-- CREATED_BY: decimal(38,10) (nullable = true)
|-- CREATION_DATE: timestamp (nullable = true)
Run Code Online (Sandbox Code Playgroud)
现在我想联合这两个数据帧。
有时,当我尝试联合这两个 DF 时,由于架构不同,它会出现错误。
如何设置 DF2 具有与 DF1 完全相同的架构(在加载期间)?
我尝试过:
df2 = spark.read.parquet(load_path2).schema(df1.schema)
Run Code Online (Sandbox Code Playgroud)
出现错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'StructType' object is not callable
Run Code Online (Sandbox Code Playgroud)
或者我应该 CAST 它(一旦读取 DF2)?
谢谢。
Shu*_*Shu 10
.schema()在此之前移动.parquet()Spark 将读取具有指定架构的 parquet 文件
df2 = spark.read.schema(df1.schema).parquet(load_path2)
Run Code Online (Sandbox Code Playgroud)