我正在复制官方文档网站上的pyspark.ml示例:http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = spark.createDataFrame(data, ["features"])
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
Run Code Online (Sandbox Code Playgroud)
但是,上面的示例不会运行并给我以下错误:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-aaffcd1239c9> in <module>()
1 from pyspark import *
2 data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
----> 3 df = spark.createDataFrame(data, ["features"])
4 kmeans = KMeans(k=2, seed=1)
5 model = kmeans.fit(df)
NameError: name 'spark' is not defined
Run Code Online (Sandbox Code Playgroud)
需要设置哪些额外的配置/变量才能运行示例?
machine-learning distributed-computing apache-spark pyspark apache-spark-ml
我有以下代码:
val data = input.map{... }.persist(StorageLevel.MEMORY_ONLY_SER).repartition(2000)
Run Code Online (Sandbox Code Playgroud)
我想知道如果我首先进行重新分配会有什么不同:
val data = input.map{... }.repartition(2000).persist(StorageLevel.MEMORY_ONLY_SER)
Run Code Online (Sandbox Code Playgroud)
呼叫赔偿的顺序是否存在差异并持续存在?谢谢!
我使用以下内容来保存带有标头的numpy数组x:
np.savetxt("foo.csv", x, delimiter=",", header="ID,AMOUNT", fmt="%i")
Run Code Online (Sandbox Code Playgroud)
但是,如果我打开"foo.cv",该文件看起来像:
# ID,AMOUNT
21,100
52,120
63,29
:
Run Code Online (Sandbox Code Playgroud)
#
标题的开头有一个额外的字符.为什么这样,有没有办法摆脱它?
我有以下两个数据框:
DF1:
Id | field_A | field_B | field_C | field_D
1 | cat | 12 | black | 11
2 | dog | 128 | white | 19
3 | dog | 35 | yellow | 20
4 | dog | 21 | brown | 4
5 | bird | 10 | blue | 7
6 | cow | 99 | brown | 34
Run Code Online (Sandbox Code Playgroud)
和
DF2:
Id | field_B | field_C | field_D | field_E
3 | 35 | …
Run Code Online (Sandbox Code Playgroud) 我试图过滤基于如下的RDD:
spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
Run Code Online (Sandbox Code Playgroud)
但是得到了以下错误:
TypeErrorTraceback (most recent call last)
<ipython-input-8-86cfb363dd8b> in <module>()
1 spark_df = sc.createDataFrame(pandas_df)
----> 2 spark_df.filter(lambda r: str(r['target']).startswith('good'))
3 spark_df.take(5)
/usr/local/spark-latest/python/pyspark/sql/dataframe.py in filter(self, condition)
904 jdf = self._jdf.filter(condition._jc)
905 else:
--> 906 raise TypeError("condition should be string or Column")
907 return DataFrame(jdf, self.sql_ctx)
908
TypeError: condition should be string or Column
Run Code Online (Sandbox Code Playgroud)
知道我错过了什么吗?谢谢!
是否可以将pandas数据框直接保存到镶木地板文件中?如果没有,建议的过程是什么?
目的是能够将镶木地板文件发送给另一个团队,他们可以使用scala代码来读取/打开它.谢谢!
是否可以在不使用sparkContext的情况下模拟RDD?
我想单元测试以下实用程序功能:
def myUtilityFunction(data1: org.apache.spark.rdd.RDD[myClass1], data2: org.apache.spark.rdd.RDD[myClass2]): org.apache.spark.rdd.RDD[myClass1] = {...}
Run Code Online (Sandbox Code Playgroud)
所以我需要将data1和data2传递给myUtilityFunction.如何从模拟org.apache.spark.rdd.RDD [myClass1]创建data1,而不是从SparkContext创建一个真正的RDD?谢谢!
我正在使用以下conf在AWS-EMR 4.1,Spark 1.5上运行一份工作:
spark-submit --deploy-mode cluster --master yarn-cluster --driver-memory 200g --driver-cores 30 --executor-memory 70g --executor-cores 8 --num-executors 90 --conf spark.storage.memoryFraction=0.45 --conf spark.shuffle.memoryFraction=0.75 --conf spark.task.maxFailures=1 --conf spark.network.timeout=1800s
Run Code Online (Sandbox Code Playgroud)
然后我得到了下面的错误.我在哪里可以找到什么是"退出状态:-100"?我怎么能解决这个问题呢?谢谢!
15/12/05 05:54:24 INFO TaskSetManager: Finished task 176.0 in stage 957.0 (TID 128408) in 130885 ms on ip-10-155-195-239.ec2.internal (106/800)
15/12/05 05:54:24 INFO YarnAllocator: Completed container container_1449241952863_0004_01_000026 (state: COMPLETE, exit status: -100)
15/12/05 05:54:24 INFO YarnAllocator: Container marked as failed: container_1449241952863_0004_01_000026. Exit status: -100. Diagnostics: Container released on a *lost* node
15/12/05 05:54:24 …
Run Code Online (Sandbox Code Playgroud) 我试图理解TensorFlow中占位符和变量之间的区别:
X = tf.placeholder("float")
W = tf.Variable(rng.randn(), name="weight")
Run Code Online (Sandbox Code Playgroud)
我还阅读了下面的Stack Overflow问题.当他们是模型的输入时,我理解它们之间的区别.
但是,一般来说,如果我们没有建立模型,那么tf.placeholder()
和之间是否还存在差异tf.Variable()
?
我有以下爆炸查询,工作正常:
data1 = sqlContext.sql("select explode(names) as name from data")
Run Code Online (Sandbox Code Playgroud)
我想爆炸另一个领域"颜色",所以最终输出可能是名称和颜色的笛卡尔积.所以我做了:
data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data")
Run Code Online (Sandbox Code Playgroud)
但是我得到了错误:
Only one generator allowed per select but Generate and and Explode found.;
Run Code Online (Sandbox Code Playgroud)
有谁有想法吗?
我实际上可以通过两个步骤使其工作:
data1 = sqlContext.sql("select explode(names) as name from data")
data1.registerTempTable('data1')
data1 = sqlContext.sql("select explode(colors) as color from data1")
Run Code Online (Sandbox Code Playgroud)
但我想知道是否有可能一步到位?非常感谢!
apache-spark ×7
python ×4
pyspark ×3
python-3.x ×2
dataframe ×1
emr ×1
hadoop-yarn ×1
hdfs ×1
header ×1
mocking ×1
numpy ×1
parquet ×1
partition ×1
persist ×1
rdd ×1
scala ×1
scalatest ×1
tensorflow ×1
unit-testing ×1