Pow*_*ers 9 apache-spark apache-spark-dataset
我想从一个简单的CSV文件创建一个Spark数据集.以下是CSV文件的内容:
name,state,number_of_people,coolness_index
trenton,nj,"10","4.5"
bedford,ny,"20","3.3"
patterson,nj,"30","2.2"
camden,nj,"40","8.8"
Run Code Online (Sandbox Code Playgroud)
以下是制作数据集的代码:
var location = "s3a://path_to_csv"
case class City(name: String, state: String, number_of_people: Long)
val cities = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.csv(location)
.as[City]
Run Code Online (Sandbox Code Playgroud)
以下是错误消息:"无法number_of_people向字符串转换为bigint,因为它可能会截断"
Databricks讨论了如何在此博客文章中创建数据集和此特定错误消息.
编码器急切地检查您的数据是否与预期的架构匹配,在您尝试错误处理TB数据之前提供有用的错误消息.例如,如果我们尝试使用太小的数据类型,那么转换为对象将导致截断(即numStudents大于一个字节,其最大值为255),Analyzer将发出AnalysisException.
我正在使用该Long类型,所以我没想到会看到此错误消息.
小智 19
使用模式推断:
val cities = spark.read
.option("inferSchema", "true")
...
Run Code Online (Sandbox Code Playgroud)
或提供架构:
val cities = spark.read
.schema(StructType(Array(StructField("name", StringType), ...)
Run Code Online (Sandbox Code Playgroud)
或演员:
val cities = spark.read
.option("header", "true")
.csv(location)
.withColumn("number_of_people", col("number_of_people").cast(LongType))
.as[City]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
9130 次 |
| 最近记录: |