相关疑难解决方法(0)

在emr-4.0.0上,spark-1.4.1 saveAsTextFile到S3非常慢

我在amazom aws emr 4.0.0中运行spark 1.4.1

对于一些共振火花saveAsTextFile在emr 4.0.0上与emr 3.8相比非常慢(为5秒,现在为95秒)

实际上saveAsTextFile表示它已经在4.356秒完成,但之后我看到很多INFO消息,com.amazonaws.latency记录器在接下来的90秒内出现404错误

spark> sc.parallelize(List.range(0, 1600000),160).map(x => x + "\t" + "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")

2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5 (saveAsTextFile at <console>:22) finished in 4.356 s
2015-09-01 21:16:17,637 INFO  [task-result-getter-2] cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all completed, from pool 
2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at <console>:22, took 4.547829 s
2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem (S3NativeFileSystem.java:listStatus(896)) - listStatus s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
2015-09-01 …
Run Code Online (Sandbox Code Playgroud)

amazon-s3 emr apache-spark

7
推荐指数
1
解决办法
1573
查看次数

写入 hdfs 路径时出现错误 java.io.IOException: Failed to rename

我使用的是使用 hadoop-2.6.5.jar 版本的 spark-sql-2.4.1v。我需要先将数据保存在 hdfs 上,然后再转移到 cassandra。因此,我试图将数据保存在 hdfs 上,如下所示:

String hdfsPath = "/user/order_items/";
cleanedDs.createTempViewOrTable("source_tab");

givenItemList.parallelStream().forEach( item -> {   
    String query = "select $item  as itemCol , avg($item) as mean groupBy year";
    Dataset<Row> resultDs = sparkSession.sql(query);

    saveDsToHdfs(hdfsPath, resultDs );   
});


public static void saveDsToHdfs(String parquet_file, Dataset<Row> df) {
    df.write()                                 
      .format("parquet")
      .mode("append")
      .save(parquet_file);
    logger.info(" Saved parquet file :   " + parquet_file + "successfully");
}
Run Code Online (Sandbox Code Playgroud)

当我在集群上运行我的工作时,它无法抛出此错误:

java.io.IOException: Failed to rename FileStatus{path=hdfs:/user/order_items/_temporary/0/_temporary/attempt_20180626192453_0003_m_000007_59/part-00007.parquet; isDirectory=false; length=952309; replication=1; blocksize=67108864; modification_time=1530041098000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to hdfs:/user/order_items/part-00007.parquet …
Run Code Online (Sandbox Code Playgroud)

hadoop hdfs apache-spark hadoop2 apache-spark-sql

0
推荐指数
1
解决办法
1744
查看次数

标签 统计

apache-spark ×2

amazon-s3 ×1

apache-spark-sql ×1

emr ×1

hadoop ×1

hadoop2 ×1

hdfs ×1