我在 DC/OS 上有一个 Spark 集群,我正在运行一个从 S3 读取的 Spark 作业。版本如下:
我通过执行以下操作读入数据:
`val hadoopConf = sparkSession.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.endpoint", Config.awsEndpoint)
hadoopConf.set("fs.s3a.access.key", Config.awsAccessKey)
hadoopConf.set("fs.s3a.secret.key", Config.awsSecretKey)
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val data = sparkSession.read.parquet("s3a://" + "path/to/file")
Run Code Online (Sandbox Code Playgroud)
` 我得到的错误是:
Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:215)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:138)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:170)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:321)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559) …Run Code Online (Sandbox Code Playgroud) 我在第一个项目中创建了Spark依赖项。但是,当我尝试使用Spark制作新项目时,我的SBT不会导入org.apache.spark的外部jar。因此,IntelliJ Idea给出了“无法解析符号”的错误。我已经尝试过从头开始制作一个新项目并使用自动导入功能,但是没有任何效果。当我尝试编译时,收到消息“对象apache不是包org的成员”。我的build.sbt看起来像这样:
name := "hello"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" % "spark-parent_2.10" % "1.4.1"
Run Code Online (Sandbox Code Playgroud)
我的印象是我的SBT设置可能有问题,尽管它已经运行了一次。并且除了外部库之外,其他所有东西都是相同的...我也尝试导入我的spark依赖项的pom.xml文件,但这也行不通。先感谢您!
我在csv文件中读取一列中应该转换为日期时间的字符串.字符串在表单中MM/dd/yyyy HH:mm.但是当我尝试使用joda-time转换它时,我总是得到错误:
线程"main"中的异常java.lang.UnsupportedOperationException:不支持类型为org.joda.time.DateTime的模式
我不知道究竟是什么问题......
val input = c.textFile("C:\\Users\\AAPL.csv").map(_.split(",")).map{p =>
val formatter: DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy HH:mm");
val date: DateTime = formatter.parseDateTime(p(0));
StockData(date, p(1).toDouble, p(2).toDouble, p(3).toDouble, p(4).toDouble, p(5).toInt, p(6).toInt)
}.toDF()
Run Code Online (Sandbox Code Playgroud)
谁能帮忙?
我在IntelliJ Idea 14.1.4中使用Scala 2.11.7安装Spark 1.4.1时遇到问题.首先:我安装了源代码版本.我应该安装Hadoop 2.4+的版本吗?我做了什么:我从tgz文件中创建了一个Maven项目并保存了它.我需要做更多吗?pom.xml文件的第一行是:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache</groupId>
<artifactId>apache</artifactId>
<version>14</version>
</parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.10</artifactId>
<version>1.4.1</version>
<packaging>pom</packaging>
<name>Spark Project Parent POM</name>
<url>http://spark.apache.org/</url>
<licenses>
<license>
<name>Apache 2.0 License</name>
<url>http://www.apache.org/licenses/LICENSE-2.0.html</url>
<distribution>repo</distribution>
</license>
</licenses>
<scm>
<connection>scm:git:git@github.com:apache/spark.git</connection>
<developerConnection>scm:git:https://git-wip-us.apache.org/repos/asf/spark.git</developerConnection>
<url>scm:git:git@github.com:apache/spark.git</url>
<tag>HEAD</tag>
</scm>
<developers>
<developer>
<id>matei</id>
<name>Matei Zaharia</name>
<email>matei.zaharia@gmail.com</email>
<url>http://www.cs.berkeley.edu/~matei</url>
<organization>Apache Software Foundation</organization>
<organizationUrl>http://spark.apache.org</organizationUrl>
</developer>
</developers>
Run Code Online (Sandbox Code Playgroud)
它试图在build.sbt中用一个简单的例子来运行spark:
name := "hello"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" % "spark-parent_2.10" % "1.4.1"
Run Code Online (Sandbox Code Playgroud)
但我得到错误:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/27 11:14:03 …Run Code Online (Sandbox Code Playgroud) 我试图让 Spark 1.4.1 与 IntelliJ Idea 14.1 中的 Scala 2.11.7 一起工作,但我不断收到此错误:
ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the `Hadoop binaries.`
Run Code Online (Sandbox Code Playgroud)
有人知道我需要下载哪些二进制文件吗?
我来自 spark 的 pom.xml 是这个(开始):
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache</groupId>
<artifactId>apache</artifactId>
<version>14</version>
</parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.10</artifactId>
<version>1.4.1</version>
<packaging>pom</packaging>
<name>Spark Project Parent POM</name>
<url>http://spark.apache.org/</url>
<licenses>
<license>
<name>Apache 2.0 License</name>
<url>http://www.apache.org/licenses/LICENSE-2.0.html</url>
<distribution>repo</distribution>
</license>
</licenses>
<scm>
<connection>scm:git:git@github.com:apache/spark.git</connection>
<developerConnection>scm:git:https://git-wip-us.apache.org/repos/asf/spark.git</developerConnection>
<url>scm:git:git@github.com:apache/spark.git</url>
<tag>HEAD</tag>
</scm>
<developers>
<developer>
<id>matei</id>
<name>Matei Zaharia</name> …Run Code Online (Sandbox Code Playgroud)