Pyspark - 为数据框定义自定义架构

Question

Pyspark - 为数据框定义自定义架构

我正在尝试读取 csv 文件，并尝试将其存储在数据框中，但是当我尝试创建IDtype 的列时StringType，它没有以预期的方式发生。

table_schema = StructType([StructField('ID', StringType(), True),
                     StructField('Name', StringType(), True),
                     StructField('Tax_Percentage(%)', IntegerType(), False),
                     StructField('Effective_From', TimestampType(), False),
                     StructField('Effective_Upto', TimestampType(), True)])

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","


df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .option("schema", table_schema) \
  .load(file_location)



display(df)

Run Code Online (Sandbox Code Playgroud)

以下是运行上述代码后生成的架构：

df:pyspark.sql.dataframe.DataFrame
ID:integer
Name:string
Tax_Percentage(%):integer
Effective_From:string
Effective_Upto :string

Run Code Online (Sandbox Code Playgroud)

尽管提供了自定义架构，但它还是ID被输入到integer我期望它是字符串的位置。与列Effective_From和相同Effective_Upto。

Answer 1

Dan*_*iel 5

它应该是

.schema(table_schema) \

Run Code Online (Sandbox Code Playgroud)

代替

.option("schema", table_schema) \

Run Code Online (Sandbox Code Playgroud)

.option("inferSchema", "true") \另外，如果您提供架构定义，则不需要:)

归档时间：	6 年，5 月前
查看次数：	14602 次
最近记录：	6 年，5 月前