Cry*_*ark 10 hadoop amazon-s3 apache-spark
这是我已经在spark用户邮件列表上提出的一个问题,我希望在这里取得更大的成功.
我不确定它与火花直接相关,虽然火花与我不能轻易解决这个问题的事实有关.
我正在尝试使用各种模式从S3获取一些文件.我的问题是其中一些模式可能没有返回任何内容,当他们这样做时,我得到以下异常:
org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://bucket/mypattern matches 0 files
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:52)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:52)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:52)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
at org.apache.spark.api.java.JavaPairRDD.reduceByKey(JavaPairRDD.scala:335)
... 2 more
Run Code Online (Sandbox Code Playgroud)
我想要一种方法来忽略丢失的文件,在这种情况下什么都不做.这里的问题是我不知道一个模式是否会在实际执行之前返回一些内容并且spark仅在动作发生时开始处理数据(这里是reduceByKey部分).所以我不能只是在某个地方发现错误并让事情继续下去.
一种解决方案是强制火花单独处理每条路径,但这可能会花费大量的速度和/或内存,所以我正在寻找另一种有效的选择.
我正在使用spark 0.9.1.谢谢
好的,深入研究一下 Spark,感谢有人在 Spark 用户列表上指导我,我想我明白了:
sc.newAPIHadoopFile("s3n://missingPattern/*", EmptiableTextInputFormat.class, LongWritable.class, Text.class, sc.hadoopConfiguration())
.map(new Function<Tuple2<LongWritable, Text>, String>() {
@Override
public String call(Tuple2<LongWritable, Text> arg0) throws Exception {
return arg0._2.toString();
}
})
.count();
Run Code Online (Sandbox Code Playgroud)
其EmptiableTextInputFormat神奇之处在于:
import java.io.IOException;
import java.util.Collections;
import java.util.List;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.InvalidInputException;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class EmptiableTextInputFormat extends TextInputFormat {
@Override
public List<InputSplit> getSplits(JobContext arg0) throws IOException {
try {
return super.getSplits(arg0);
} catch (InvalidInputException e) {
return Collections.<InputSplit> emptyList();
}
}
}
Run Code Online (Sandbox Code Playgroud)
人们最终可以检查消息的InvalidInputException准确性。
| 归档时间: |
|
| 查看次数: |
3535 次 |
| 最近记录: |