在客户端模式下启动pyspark.bin/pyspark --master yarn-client --num-executors 60shell上的import numpy很好但是在kmeans中失败了.不知何故,执行者没有安装numpy是我的感觉.我没有找到任何好的解决方案让工人知道numpy.我尝试设置PYSPARK_PYTHON,但这也没有用.
import numpy
features = numpy.load(open("combined_features.npz"))
features = features['arr_0']
features.shape
features_rdd = sc.parallelize(features, 5000)
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
clusters = KMeans.train(features_rdd, 2, maxIterations=10, runs=10, initializationMode="random")
Run Code Online (Sandbox Code Playgroud)
堆栈跟踪
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: …Run Code Online (Sandbox Code Playgroud) 我正在使用这个代码示例http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase来读取hbase使用Spark的表只有通过代码添加hbase.zookeeper.quorum的唯一更改,因为它没有从hbase-site.xml中选择它.
Spark 1.5.3 HBase 0.98.0
我正面临着这个错误 -
java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString
at org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:921)
at org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:132)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1520)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1294)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1111)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:201)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:159)
at test.MyHBase.getTable(MyHBase.scala:33)
at test.MyHBase.<init>(MyHBase.scala:11)
at $line43.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.fetch(<console>:30)
at $line44.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:49)
at $line44.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at …Run Code Online (Sandbox Code Playgroud) 我正在使用sbt程序集来创建一个可以在火花上运行的胖罐.有依赖性grpc-netty.spark上的Guava版本比所需的版本旧grpc-netty,我遇到了这个错误:java.lang.NoSuchMethodError:com.google.common.base.Preconditions.checkArgument.我能够通过在spark上将userClassPathFirst设置为true来解决此问题,但会导致其他错误.
如果我错了,请纠正我,但根据我的理解,如果我正确地进行着色,我不应该将userClassPathFirst设置为true.这是我现在的着色方式:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.guava.**" -> "my_conf.@1")
.inLibrary("com.google.guava" % "guava" % "20.0")
.inLibrary("io.grpc" % "grpc-netty" % "1.1.2")
)
libraryDependencies ++= Seq(
"org.scalaj" %% "scalaj-http" % "2.3.0",
"org.json4s" %% "json4s-native" % "3.2.11",
"org.json4s" %% "json4s-jackson" % "3.2.11",
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" % "spark-sql_2.11" % "2.2.0" % "provided",
"org.clapper" %% "argot" % "1.0.3",
"com.typesafe" % "config" % "1.3.1",
"com.databricks" %% "spark-csv" % "1.5.0",
"org.apache.spark" % "spark-mllib_2.11" % "2.2.0" % …Run Code Online (Sandbox Code Playgroud) 当我尝试Array使用列表推导创建一个时,Array{Any, 1}即使我将所有元素编码为"symbol" ,它也会产生:
julia> u_col_names=[symbol("user_id"), symbol("age"), symbol("sex"), symbol("occupation"), symbol("zip_code")]
5-element Array{Symbol,1}:
:user_id
:age
:sex
:occupation
:zip_code
julia> col_names=["user_id", "age", "sex", "occupation", "zip_code"]
5-element Array{ASCIIString,1}:
"user_id"
"age"
"sex"
"occupation"
"zip_code"
julia> u_col_names=[symbol(col_names[i]) for i in 1:size(col_names)[1]]
5-element Array{Any,1}:
:user_id
:age
:sex
:occupation
:zip_code
Run Code Online (Sandbox Code Playgroud)
为什么最后一个列表理解返回Array{Any, 1}而不是Array{Symbol, 1}?请注意,以下内容确实返回Array{Symbol, 1}:
julia> u_col_names=[symbol("col_names$i") for i in 1:size(col_names)[1]]
5-element Array{Symbol,1}:
:col_names1
:col_names2
:col_names3
:col_names4
:col_names5
Run Code Online (Sandbox Code Playgroud)
有趣的是,以下内容也是如此:
julia> col_names[1]
"user_id"
julia> symbol(col_names[1])
:user_id
julia> [symbol(col_names[1]), …Run Code Online (Sandbox Code Playgroud) 有没有办法做这样的事情(这是在R)
df$dataCol <- as.Date(df$dataCol, format="%Y%m%d")
Run Code Online (Sandbox Code Playgroud)
其中dataCol的格式为"20151009".
我试图使用"g ++ main.cpp -c"编译下面的代码,但它给了我这个奇怪的错误..任何想法?
main.cpp: In function ‘int main()’:
main.cpp:9:17: error: invalid conversion from ‘Graph*’ to ‘int’
main.cpp:9:17: error: initializing argument 1 of ‘Graph::Graph(int)’
main.cpp:10:16: warning: deprecated conversion from string constant to ‘char*’
Run Code Online (Sandbox Code Playgroud)
这是我正在尝试编译的主要模块,下面是我在graph.hpp中的图形类
#include <iostream>
#include "graph.hpp"
using namespace std;
int main()
{
Graph g;
g = new Graph();
char* path = "graph.csv";
g.createGraph(path);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
这是我的Graph类
/*
* graph.hpp
*
* Created on: Jan 28, 2012
* Author: ajinkya
*/
#ifndef _GRAPH_HPP_
#define _GRAPH_HPP_
#include "street.hpp"
#include …Run Code Online (Sandbox Code Playgroud)