我发现一些代码在本地启动spark:
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val ctx = new SparkContext(conf)
Run Code Online (Sandbox Code Playgroud)
什么[*]意思?
我用过这段代码
我的错误是:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/03 20:39:24 INFO SparkContext: Running Spark version 2.1.0
17/02/03 20:39:25 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/02/03 20:39:25 WARN SparkConf: Detected deprecated memory fraction
settings: [spark.storage.memoryFraction]. As of Spark 1.6, execution and
storage memory management are unified. All memory fractions used in the old
model are now deprecated and no longer read. If you wish to use the old
memory management, you …Run Code Online (Sandbox Code Playgroud) 在玩具示例中简单明了地展示了如何在spark中编程.您只需导入,创建,使用和丢弃所有功能.
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
def main(args: String) {
val conf = new SparkConf().setAppName("example")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
import hiveContext.sql
// load data from hdfs
val df1 = sqlContext.textFile("hdfs://.../myfile.csv").map(...)
val df1B = sc.broadcast(df1)
// load data from hive
val df2 = sql("select * from mytable")
// transform df2 with df1B
val cleanCol = udf(cleanMyCol(df1B)).apply("myCol")
val df2_new = df2.withColumn("myCol", cleanCol)
...
sc.stop()
}
Run Code Online (Sandbox Code Playgroud)
在现实世界中,我发现自己编写了很多函数来模块化任务.例如,我只有几个函数来加载不同的数据表.在这些加载函数中,我会调用其他函数在加载数据时进行必要的数据清理/转换.然后我会像这样传递上下文:
def loadHdfsFileAndBroadcast(sc: …Run Code Online (Sandbox Code Playgroud) 我有:
val sparkBuilder: SparkSession.Builder = SparkSession
.builder
.appName("CreateModelDataPreparation")
.config("spark.master", "local")
implicit val spark: SparkSession = sparkBuilder.getOrCreate()
Run Code Online (Sandbox Code Playgroud)
但是,当我运行我的程序时,我仍然得到:
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
Run Code Online (Sandbox Code Playgroud)
SparkSession 按照其他帖子中的建议在 Main 方法中设置。这些似乎并没有解决问题。
这与建议的重复不同,因为我已经尝试过两者:
def main(argv: Array[String]): Unit = {
import DeweyConfigs.implicits.da3wConfig
val commandlineArgs: DeweyReaderArgs = processCommandLineArgs(argv)
val sparkBuilder: SparkSession.Builder = SparkSession
.builder
.appName("CreateModelDataPreparation")
.master("local")
implicit val spark: SparkSession = sparkBuilder.config("spark.master", "local").getOrCreate()
import spark.implicits._
...
Run Code Online (Sandbox Code Playgroud)
和
def main(argv: …Run Code Online (Sandbox Code Playgroud)