471*_*711 5 r sparkr apache-spark-1.5
我是Spark的新手,想知道下面有哪些选项可以使用SparkR从RStudio读取存储在hdfs中的数据,或者我是否正确使用它们.数据可以是任何类型(纯文本,csv,json,xml或任何包含关系表的数据库)和任何大小(1kb - 几gb).
我知道不应该再使用textFile(sc,path)了,除了read.df函数之外还有其他可能读取这类数据吗?
以下代码使用read.df和jsonFile但jsonFile会产生错误:
Sys.setenv(SPARK_HOME = "C:\\Users\\--\\Downloads\\spark-1.5.0-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
#load the Sparkr library
library(SparkR)
# Create a spark context and a SQL context
sc <- sparkR.init(master="local", sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)
#create a sparkR DataFrame
df <- read.df(sqlContext, "hdfs://0.0.0.0:19000/people.json", source = "json")
df <- jsonFile(sqlContext, "hdfs://0.0.0.0:19000/people.json")
Run Code Online (Sandbox Code Playgroud)
read.df适用于json,但是如何读取仅由新行分隔的日志消息等文本?例如
> df <- read.df(sqlContext, "hdfs://0.0.0.0:19000/README.txt", "text")
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.ClassNotFoundException: Failed to load class for data source: text.
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:87)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:156)
at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)
at org.apache.spark.ap
Run Code Online (Sandbox Code Playgroud)
jsonFile的错误是:
> df <- jsonFile(sqlContext, "hdfs://0.0.0.0:19000/people.json")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfu
Run Code Online (Sandbox Code Playgroud)
我不知道为什么read.df会抛出错误,因为我没有重新使用SparkR或调用SparkR.stop()
对于除了使用read.df相同的代码我使用SparkR :::文本文件功能和SC,而不是sqlContext(以下对过时的介绍amplab).
错误消息是:
data <- SparkR:::textFile(sc, "hdfs://0.0.0.0:19000/people.json")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 5: hdfs:
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.hadoop.fs.Globber.glob(Globber.java:211)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at or
Run Code Online (Sandbox Code Playgroud)
这个错误看起来路径不正确,但我不知道为什么.
我目前使用的是什么:
spark-1.5.0-bin-hadoop2.6 hadoop-2.6.0 Windows(8.1)R版本3.2.2 Rstudio版本0.99.484
我希望有人可以在这里给我一些关于这个问题的提示.
小智 1
尝试
% hadoop fs -put people.json /
% sparkR
> people <- read.df(sqlContext, "/people.json", "json")
> head(people)
Run Code Online (Sandbox Code Playgroud)