在Sparklyr中指定col类型(spark_read_csv)

Lev*_*man 4 r sparklyr

我正在使用SpraklyR在csv中读到spark

schema <- structType(structField("TransTime", "array<timestamp>", TRUE),
                 structField("TransDay", "Date", TRUE))

 spark_read_csv(sc, filename, "path", infer_schema = FALSE, schema = schema)
Run Code Online (Sandbox Code Playgroud)

但得到:

Error: could not find function "structType"
Run Code Online (Sandbox Code Playgroud)

如何使用spark_read_csv指定colunm类型?

提前致谢.

Jad*_*ins 7

structType函数来自Scala的SparkAPI,在Sparklyr中指定必须在"column"参数中作为列表传递它的数据类型,假设我们有以下CSV(data.csv):

name,birthdate,age,height
jader,1994-10-31,22,1.79
maria,1900-03-12,117,1.32
Run Code Online (Sandbox Code Playgroud)

读取相应数据的功能是:

mycsv <- spark_read_csv(sc, "mydate", 
                          path =  "data.csv", 
                          memory = TRUE,
                          infer_schema = FALSE, #attention to this
                          columns = list(
                            name = "character",
                            birthdate = "date", #or character because needs date functions
                            age = "integer",
                            height = "double"))
# integer = "INTEGER"
# double = "REAL"
# character = "STRING"
# logical = "INTEGER"
# list = "BLOB"
# date = character = "STRING" # not sure
Run Code Online (Sandbox Code Playgroud)

对于操作日期类型,您必须使用配置日期功能,而不是R功能.

mycsv %>% mutate(birthyear = year(birthdate))
Run Code Online (Sandbox Code Playgroud)

参考:https://spark.rstudio.com/articles/guides-dplyr.html#hive-functions


小智 3

有关如何执行此操作的示例以及有关 Sparklyr 的免费在线书籍中解释的详细信息https://therinspark.com/data.html

但 Jader Martins 的答案中的命名列表示例更简单

  • 404 - 无效链接 (5认同)