我有一个shell provisioner
包装机与用户连接到一个盒子vagrant
{
"environment_vars": [
"HOME_DIR=/home/vagrant"
],
"expect_disconnect": true,
"scripts": [
"scripts/foo.sh"
],
"type": "shell"
}
Run Code Online (Sandbox Code Playgroud)
脚本的内容是:
whoami
sudo su
whoami
Run Code Online (Sandbox Code Playgroud)
并且输出奇怪地仍然存在:
whoami
sudo su
whoami
Run Code Online (Sandbox Code Playgroud)
为什么我不能切换到root用户?如何以 root 身份执行语句?请注意,我不想引用所有语句sudo "statement |foo"
,而是像演示的那样全局切换用户sudo su
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html很好地解释了一个支点如何用于火花.
在我的python代码中,我使用没有聚合的pandas但重置了索引并加入:
pd.pivot_table(data=dfCountries, index=['A'], columns=['B'])
countryToMerge.index.name = 'ISO'
df.merge(countryToMerge['value'].reset_index(), on='ISO', how='inner')
Run Code Online (Sandbox Code Playgroud)
这如何在火花中起作用?
我尝试手动分组和加入,如:
val grouped = countryKPI.groupBy("A").pivot("B")
df.join(grouped, df.col("ISO") === grouped.col("ISO")).show
Run Code Online (Sandbox Code Playgroud)
但这不起作用.reset_index如何适应spark /如何以火花原生方式实现?
python代码的最小示例:
import pandas as pd
from datetime import datetime, timedelta
import numpy as np
dates = pd.DataFrame([(datetime(2016, 1, 1) + timedelta(i)).strftime('%Y-%m-%d') for i in range(10)], columns=["dates"])
isos = pd.DataFrame(["ABC", "POL", "ABC", "POL","ABC", "POL","ABC", "POL","ABC", "POL"], columns=['ISO'])
dates['ISO'] = isos.ISO
dates['ISO'] = dates['ISO'].astype("category")
countryKPI = pd.DataFrame({'country_id3':['ABC','POL','ABC','POL'],
'indicator_id':['a','a','b','b'],
'value':[7,8,9,7]})
countryToMerge = pd.pivot_table(data=countryKPI, index=['country_id3'], columns=['indicator_id'])
countryToMerge.index.name = 'ISO' …
Run Code Online (Sandbox Code Playgroud) 类似于如何在scikit中仅将参数传递给管道对象的一部分?我想将参数传递给管道的一部分.通常,它应该工作正常,如:
estimator = XGBClassifier()
pipeline = Pipeline([
('clf', estimator)
])
Run Code Online (Sandbox Code Playgroud)
并执行像
pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20)
Run Code Online (Sandbox Code Playgroud)
但它失败了:
/usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
114 """
115 Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
--> 116 self.steps[-1][-1].fit(Xt, yt, **fit_params)
117 return self
118
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose)
443 early_stopping_rounds=early_stopping_rounds,
444 evals_result=evals_result, obj=obj, feval=feval,
--> 445 verbose_eval=verbose)
446
447 self.objective = xgb_options["objective"]
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, learning_rates, xgb_model, …
Run Code Online (Sandbox Code Playgroud) 在执行某些函数后,为什么nullable = true?df中仍然没有纳米值.
val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Int"))
myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show
Run Code Online (Sandbox Code Playgroud)
当nullable = true
被调用时,nullable对于两列都是false.
val foo: (Int => String) = (t: Int) => {
fooMap.get(t) match {
case Some(tt) => tt
case None => "notFound"
}
}
val fooMap = Map(
1 -> "small",
2 -> "big"
)
val fooUDF = udf(foo)
myDf
.withColumn("foo", fooUDF(col("foo")))
.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
.select("foo", "foo_2")
.printSchema
Run Code Online (Sandbox Code Playgroud)
但是现在,对于至少一个之前为假的列,可以为空.怎么解释这个?
我有一个程序,它重复循环一个pandas数据框,如下所示:
monts = [some months]
for month in months:
df = original_df[original_df.month == month].copy()
result = some_function(df)
print(result)
Run Code Online (Sandbox Code Playgroud)
但是,每次迭代所需的内存不断增加
types | # objects | total size
================================================ | =========== | ============
<class 'pandas.core.frame.DataFrame | 22 | 6.54 GB
<class 'pandas.core.series.Series | 1198 | 4.72 GB
<class 'numpy.ndarray | 1707 | 648.19 MB
<class 'pandas.core.categorical.Categorical | 238 | 368.90 MB
<class 'pandas.core.indexes.base.Index | 256 | 312.03 MB
================================================ | =========== | ============
<class 'pandas.core.frame.DataFrame | 30 | 9.04 GB
<class 'pandas.core.series.Series …
Run Code Online (Sandbox Code Playgroud) 我按照戏剧描述如何使用演员:https://www.playframework.com/documentation/2.4.x/ScalaAkka他们建议如下:
@Singleton
class Application @Inject() (system: ActorSystem) extends Controller {
val helloActor = system.actorOf(HelloActor.props, "hello-actor")
//...
}
Run Code Online (Sandbox Code Playgroud)
但这会导致:
play.sbt.PlayExceptions$CompilationException: Compilation error[trait Singleton is abstract; cannot be instantiated]
at play.sbt.PlayExceptions$CompilationException$.apply(PlayExceptions.scala:27) ~[na:na]
at play.sbt.PlayExceptions$CompilationException$.apply(PlayExceptions.scala:27) ~[na:na]
at scala.Option.map(Option.scala:145) ~[scala-library-2.11.6.jar:na]
at play.sbt.run.PlayReload$$anonfun$taskFailureHandler$1.apply(PlayReload.scala:49) ~[na:na]
at play.sbt.run.PlayReload$$anonfun$taskFailureHandler$1.apply(PlayReload.scala:44) ~[na:na]
at scala.Option.map(Option.scala:145) ~[scala-library-2.11.6.jar:na]
at play.sbt.run.PlayReload$.taskFailureHandler(PlayReload.scala:44) ~[na:na]
at play.sbt.run.PlayReload$.compileFailure(PlayReload.scala:40) ~[na:na]
at play.sbt.run.PlayReload$$anonfun$compile$1.apply(PlayReload.scala:17) ~[na:na]
at play.sbt.run.PlayReload$$anonfun$compile$1.apply(PlayReload.scala:17) ~[na:na]
Run Code Online (Sandbox Code Playgroud)
Tos看到我做了什么或跟着做了什么:https://github.com/dataplayground/playground
将@Singleton
导线移至:
could not find implicit value for parameter timeout: akka.util.Timeout
Run Code Online (Sandbox Code Playgroud)
这是代码:
implicit …
Run Code Online (Sandbox Code Playgroud) 我怎样才能获得每组的前n名(比如前10名或前3名)spark-sql
?
http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/提供一般的SQL教程.但是,spark不会在where子句中实现子查询.
如何将 Spark 2.0.1 中的单个列转换为数组?
+---+-----+
| id| dist|
+---+-----+
|1.0|2.0|
|2.0|4.0|
|3.0|6.0|
|4.0|8.0|
+---+-----+
Run Code Online (Sandbox Code Playgroud)
应该返回Array(1.0, 2.0, 3.0, 4.0)
A
import scala.collection.JavaConverters._
df.select("id").collectAsList.asScala.toArray
Run Code Online (Sandbox Code Playgroud)
失败了
java.lang.RuntimeException: Unsupported array type: [Lorg.apache.spark.sql.Row;
java.lang.RuntimeException: Unsupported array type: [Lorg.apache.spark.sql.Row;
Run Code Online (Sandbox Code Playgroud) 类似于Spark - Group by Key然后Count by Value将允许我df.series.value_counts()
在Spark中模拟Pandas的功能:
生成的对象将按降序排列,以便第一个元素是最常出现的元素.默认情况下排除NA值.(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
我很好奇,如果在Spark中数据帧不能更好/更简单.
最近我偶然发现了http://dask.pydata.org/en/latest/ 因为我有一些只能在单核上运行的 Pandas 代码,我想知道如何利用我的其他 CPU 核。dask 可以很好地使用所有(本地)CPU 内核吗?如果是,它与熊猫的兼容性如何?
我可以对 Pandas 使用多个 CPU 吗?到目前为止,我阅读了有关发布 GIL 的信息,但这一切似乎都相当复杂。