我正在使用CDH 5.2.我可以使用spark-shell来运行命令.
提前致谢
我正在创建avro RDD以下代码.
def convert2Avro(data : String ,schema : Schema) : AvroKey[GenericRecord] = {
var wrapper = new AvroKey[GenericRecord]()
var record = new GenericData.Record(schema)
record.put("empname","John")
wrapper.datum(record)
return wrapper
}
Run Code Online (Sandbox Code Playgroud)
并创建avro RDD如下.
var avroRDD = fieldsRDD.map(x =>(convert2Avro(x, schema)))
Run Code Online (Sandbox Code Playgroud)
执行时,我在上面的行中得到以下异常
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.map(RDD.scala:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
Run Code Online (Sandbox Code Playgroud)
任何指针?
我可以使用以下代码在两个RDD中打印数据.
usersRDD.foreach(println)
empRDD.foreach(println)
Run Code Online (Sandbox Code Playgroud)
我需要比较两个RDD中的数据.如何迭代和比较一个RDD中的字段数据与另一个RDD中的字段数据.例如:迭代记录并检查名称和年龄userRDD是否具有匹配记录empRDD,如果没有放入单独的RDD中.
我尝试过,userRDD.substract(empRDD)但它正在比较所有的领域.