小编Big*_*bie的帖子

在 Spark 上使用 Scala 拆分 Dataframe 中的字符串

我有一个包含 100 多列的日志文件。其中我只需要两列“_raw”和“_time”,所以我创建了将日志文件加载为“csv”DF。

第1步:

scala> val log = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("soa_prod_diag_10_jan.csv")
log: org.apache.spark.sql.DataFrame = [ARRAffinity: string, CoordinatorNonSecureURL: string ... 126 more fields]
Run Code Online (Sandbox Code Playgroud)

第 2 步:我将 DF 注册为临时表 log.createOrReplaceTempView("logs"

第 3 步:我提取了两个必需的列“_raw”和“_time”

scala> val sqlDF = spark.sql("select _raw, _time from logs")
sqlDF: org.apache.spark.sql.DataFrame = [_raw: string, _time: string]

scala> sqlDF.show(1, false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|_raw                                                                                                                                                                                                                                                                                                                                                                                                |_time|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|[2019-01-10T23:59:59.998-06:00] [xx_yyy_zz_sss_ra10] [ERROR] [OSB-473003] [oracle.osb.statistics.statistics] [tid: [ACTIVE].ExecuteThread: '28' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b,0] [partition-name: DOMAIN] [tenant-name: GLOBAL] Aggregation Server Not Available. Failed …
Run Code Online (Sandbox Code Playgroud)

scala apache-spark apache-spark-sql

-3
推荐指数
1
解决办法
3977
查看次数

标签 统计

apache-spark ×1

apache-spark-sql ×1

scala ×1