我试图从我的mac执行pyspark在EC2 spark集群上进行计算.
如果我登录到群集,它按预期工作:
$ ec2/spark-ec2 -i ~/.ec2/spark.pem -k spark login test-cluster2
$ spark/bin/pyspark
Run Code Online (Sandbox Code Playgroud)
然后做一个简单的任务
>>> data=sc.parallelize(range(1000),10)`
>>> data.count()
Run Code Online (Sandbox Code Playgroud)
按预期工作:
14/06/26 16:38:52 INFO spark.SparkContext: Starting job: count at <stdin>:1
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 10 output partitions (allowLocal=false)
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at <stdin>:1)
...
14/06/26 16:38:53 INFO spark.SparkContext: Job finished: count at <stdin>:1, took 1.195232619 s
1000
Run Code Online (Sandbox Code Playgroud)
但是现在如果我从本地机器尝试同样的东西,
$ MASTER=spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077 bin/pyspark
Run Code Online (Sandbox Code Playgroud)
它似乎无法连接到群集
14/06/26 09:45:43 INFO AppClient$ClientActor: Connecting to master …Run Code Online (Sandbox Code Playgroud) In[216]: foo = pd.DataFrame({'a':[1,2,3], 'b':[3,4,5]})
In[217]: bar = foo.ix[:1]
In[218]: bar
Out[218]:
a b
0 1 3
1 2 4
Run Code Online (Sandbox Code Playgroud)
视图按预期创建.
In[219]: bar['a'] = 100
In[220]: bar
Out[220]:
a b
0 100 3
1 100 4
In[221]: foo
Out[221]:
a b
0 100 3
1 100 4
2 3 5
Run Code Online (Sandbox Code Playgroud)
如果修改了视图,原始数据帧foo也是如此.但是,如果使用"无"进行分配,则可能会生成副本.任何人都可以了解正在发生的事情以及背后的逻辑吗?
In[222]: bar['a'] = None
In[223]: bar
Out[223]:
a b
0 None 3
1 None 4
In[224]: foo
Out[224]:
a b
0 100 3
1 100 4
2 3 5
Run Code Online (Sandbox Code Playgroud) 使用简单的数据框来说明此问题:
df <- data.frame(x=c(1,2,3), y1=c(1,2,3), y2=c(3,4,5))
Run Code Online (Sandbox Code Playgroud)
单时间序列图很容易:
hPlot(y="y1", x="x", data=df)
Run Code Online (Sandbox Code Playgroud)
无法弄清楚如何同时绘制y1和y2.试过这个,但它返回一个错误
> hPlot(x='x', y=c('y1','y2'), data=df)
Run Code Online (Sandbox Code Playgroud)
Error in .subset2(x, i, exact = exact) : subscript out of bounds
检查hPlot中用于[[从输入data.frame中提取一列的代码,这是否意味着它只适用于单个时间序列?
hPlot <- highchartPlot <- function(..., radius = 3, title = NULL, subtitle = NULL, group.na = NULL){
rChart <- Highcharts$new()
# Get layers
d <- getLayer(...)
data <- data.frame(
x = d$data[[d$x]],
y = d$data[[d$y]]
)
Run Code Online (Sandbox Code Playgroud) amazon-ec2 ×1
apache-spark ×1
dataframe ×1
highcharts ×1
pandas ×1
python ×1
r ×1
rcharts ×1
shiny ×1