Spark写入postgres慢

Ade*_*nde 4 dataframe apache-spark apache-spark-sql

我正在将数据(约83M记录)从数据帧写入postgresql,这有点慢.花费2.7小时完成对db的写入.

查看执行程序,只有一个执行程序只运行一个活动任务.有没有什么办法可以使用Spark中的所有执行程序将写入并行化到db中?

...
val prop = new Properties()
prop.setProperty("user", DB_USER)
prop.setProperty("password", DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")



salesReportsDf.write
              .mode(SaveMode.Append)
              .jdbc(s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE", REPORTS_TABLE, prop)
Run Code Online (Sandbox Code Playgroud)

谢谢

Ade*_*nde 5

所以我想出了问题所在.基本上,重新分区我的数据帧可将数据库写入吞吐量提高100%

def srcTable(config: Config): Map[String, String] = {

  val SERVER             = config.getString("db_host")
  val PORT               = config.getInt("db_port")
  val DATABASE           = config.getString("database")
  val USER               = config.getString("db_user")
  val PASSWORD           = config.getString("db_password")
  val TABLE              = config.getString("table")
  val PARTITION_COL      = config.getString("partition_column")
  val LOWER_BOUND        = config.getString("lowerBound")
  val UPPER_BOUND        = config.getString("upperBound")
  val NUM_PARTITION      = config.getString("numPartitions")

  Map(
    "url"     -> s"jdbc:postgresql://$SERVER:$PORT/$DATABASE",
    "driver"  -> "org.postgresql.Driver",
    "dbtable" -> TABLE,
    "user"    -> USER,
    "password"-> PASSWORD,
    "partitionColumn" -> PARTITION_COL,
    "lowerBound" -> LOWER_BOUND,
    "upperBound" -> UPPER_BOUND,
    "numPartitions" -> NUM_PARTITION
  )

}
Run Code Online (Sandbox Code Playgroud)