Spark 2.2.0 - 如何向DynamoDB写入/读取DataFrame

Béa*_*nac 8 scala amazon-emr amazon-dynamodb apache-spark

我希望我的Spark应用程序从DynamoDB中读取一个表,执行操作,然后将结果写入DynamoDB.

将表读入DataFrame

现在,我可以将DynamoDB中的表读入Spark作为a hadoopRDD并将其转换为DataFrame.但是,我必须使用正则表达式从中提取值AttributeValue.有更好/更优雅的方式吗?在AWS API中找不到任何内容.

package main.scala.util

import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RDD
import scala.util.matching.Regex
import java.util.HashMap

import com.amazonaws.services.dynamodbv2.model.AttributeValue
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
/* Importing DynamoDBInputFormat and DynamoDBOutputFormat */
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable

object Tester {

  // {S: 298905396168806365,} 
  def extractValue : (String => String) = (aws:String) => {
    val pat_value = "\\s(.*),".r

    val matcher = pat_value.findFirstMatchIn(aws)
                matcher match {
                case Some(number) => number.group(1).toString
                case None => ""
        }
  }


   def main(args: Array[String]) {
    val spark = SparkSession.builder().getOrCreate()
    val sparkContext = spark.sparkContext

      import spark.implicits._

      // UDF to extract Value from AttributeValue 
      val col_extractValue = udf(extractValue)

  // Configure connection to DynamoDB
  var jobConf_add = new JobConf(sparkContext.hadoopConfiguration)
      jobConf_add.set("dynamodb.input.tableName", "MyTable")
      jobConf_add.set("dynamodb.output.tableName", "MyTable")
      jobConf_add.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
      jobConf_add.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")


      // org.apache.spark.rdd.RDD[(org.apache.hadoop.io.Text, org.apache.hadoop.dynamodb.DynamoDBItemWritable)]
      var hadooprdd_add = sparkContext.hadoopRDD(jobConf_add, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])

      // Convert HadoopRDD to RDD
      val rdd_add: RDD[(String, String)] = hadooprdd_add.map {
      case (text, dbwritable) => (dbwritable.getItem().get("PIN").toString(), dbwritable.getItem().get("Address").toString())
      }

      // Convert RDD to DataFrame and extract Values from AttributeValue
      val df_add = rdd_add.toDF()
                  .withColumn("PIN", col_extractValue($"_1"))
                  .withColumn("Address", col_extractValue($"_2"))
                  .select("PIN","Address")
   }
}
Run Code Online (Sandbox Code Playgroud)

将DataFrame写入DynamoDB

stackoverflow和其他地方的许多答案只指向博客文章emr-dynamodb-hadoop github.这些资源都没有实际演示如何写入DynamoDB.

我尝试将转换DataFrameRDD[Row]失败.

df_add.rdd.saveAsHadoopDataset(jobConf_add)
Run Code Online (Sandbox Code Playgroud)

将此DataFrame写入DynamoDB的步骤是什么?(如果你告诉我如何控制overwritevs putItem,可以获得奖励积分;)

注意:df_add具有与MyTableDynamoDB中相同的模式.

编辑:我按照这个答案的建议指出这篇关于使用Spark SQL for ETL的帖子:

// Format table to DynamoDB format
  val output_rdd =  df_add.as[(String,String)].rdd.map(a => {
    var ddbMap = new HashMap[String, AttributeValue]()

    // Field PIN
    var PINValue = new AttributeValue() // New AttributeValue
    PINValue.setS(a._1)                 // Set value of Attribute as String. First element of tuple
    ddbMap.put("PIN", PINValue)         // Add to HashMap

    // Field Address
    var AddValue = new AttributeValue() // New AttributeValue
    AddValue.setS(a._2)                 // Set value of Attribute as String
    ddbMap.put("Address", AddValue)     // Add to HashMap

    var item = new DynamoDBItemWritable()
    item.setItem(ddbMap)

    (new Text(""), item)
  })             

  output_rdd.saveAsHadoopDataset(jobConf_add) 
Run Code Online (Sandbox Code Playgroud)

但是,java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.hadoop.io.Text尽管遵循了文档,我现在得到了...你有什么建议吗?

编辑2:仔细阅读关于使用Spark SQL for ETL的这篇文章:

获得DataFrame后,执行转换以使RDD与DynamoDB自定义输出格式知道如何编写的类型相匹配.自定义输出格式需要包含Text和DynamoDBItemWritabletypes 的元组.

考虑到这一点,下面的代码正是theAWS博客文章提出的建议,除了我output_df作为rdd 投射否则saveAsHadoopDataset不起作用.现在,我得到了Exception in thread "main" scala.reflect.internal.Symbols$CyclicReference: illegal cyclic reference involving object InterfaceAudience.我在绳子尽头!

      // Format table to DynamoDB format
  val output_df =  df_add.map(a => {
    var ddbMap = new HashMap[String, AttributeValue]()

    // Field PIN
    var PINValue = new AttributeValue() // New AttributeValue
    PINValue.setS(a.get(0).toString())                 // Set value of Attribute as String
    ddbMap.put("PIN", PINValue)         // Add to HashMap

    // Field Address
    var AddValue = new AttributeValue() // New AttributeValue
    AddValue.setS(a.get(1).toString())                 // Set value of Attribute as String
    ddbMap.put("Address", AddValue)     // Add to HashMap

    var item = new DynamoDBItemWritable()
    item.setItem(ddbMap)

    (new Text(""), item)
  })             

  output_df.rdd.saveAsHadoopDataset(jobConf_add)   
Run Code Online (Sandbox Code Playgroud)

Ave*_*ell 6

我跟踪了“将Spark SQL用于ETL”链接,并发现了相同的“非法循环引用”异常。如下所示,该异常的解决方案非常简单(但是花了我2天的时间才能弄清楚)。关键是在数据框的RDD上使用映射功能,而不是在数据框本身上使用。

val ddbConf = new JobConf(spark.sparkContext.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "<myTableName>")
ddbConf.set("dynamodb.throughput.write.percent", "1.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")


val df_ddb =  spark.read.option("header","true").parquet("<myInputFile>")
val schema_ddb = df_ddb.dtypes

var ddbInsertFormattedRDD = df_ddb.rdd.map(a => {
    val ddbMap = new HashMap[String, AttributeValue]()

    for (i <- 0 to schema_ddb.length - 1) {
        val value = a.get(i)
        if (value != null) {
            val att = new AttributeValue()
            att.setS(value.toString)
            ddbMap.put(schema_ddb(i)._1, att)
        }
    }

    val item = new DynamoDBItemWritable()
    item.setItem(ddbMap)

    (new Text(""), item)
}
)

ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)
Run Code Online (Sandbox Code Playgroud)


Ana*_*dor 5

我们为Spark创建了DynamoDB自定义数据源:

https://github.com/audienceproject/spark-dynamodb

它具有许多优雅的功能:

  • 具有延迟评估的分布式并行扫描
  • 通过速率限制对已配置表/索引容量的目标部分进行吞吐量控制
  • 满足您需求的架构发现
  • 动态推理
  • 案例类的静态分析
  • 列和过滤器下推
  • 全球二级索引支持
  • 写支持

我认为这绝对适合您的用例。如果您可以检查一下并提供反馈,我们将非常乐意。