我有以下爆炸查询,工作正常:
data1 = sqlContext.sql("select explode(names) as name from data")
Run Code Online (Sandbox Code Playgroud)
我想爆炸另一个领域"颜色",所以最终输出可能是名称和颜色的笛卡尔积.所以我做了:
data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data")
Run Code Online (Sandbox Code Playgroud)
但是我得到了错误:
Only one generator allowed per select but Generate and and Explode found.;
Run Code Online (Sandbox Code Playgroud)
有谁有想法吗?
我实际上可以通过两个步骤使其工作:
data1 = sqlContext.sql("select explode(names) as name from data")
data1.registerTempTable('data1')
data1 = sqlContext.sql("select explode(colors) as color from data1")
Run Code Online (Sandbox Code Playgroud)
但我想知道是否有可能一步到位?非常感谢!
我正在使用Keras构建一个深度学习LSTM模型,使用TensorFlow后端.每次运行模型时,结果都不同.有没有办法修复种子以产生可重复的结果?谢谢!
我试图将数据帧转换为RDD,然后执行下面的一些操作以返回元组:
df.rdd.map { t=>
(t._2 + "_" + t._3 , t)
}.take(5)
Run Code Online (Sandbox Code Playgroud)
然后我得到了下面的错误.有人有主意吗?谢谢!
<console>:37: error: value _2 is not a member of org.apache.spark.sql.Row
(t._2 + "_" + t._3 , t)
^
Run Code Online (Sandbox Code Playgroud) 我在AWS集群上运行了一些pyspark程序.我正在通过Spark UI监控这项工作(见附件).但是,我注意到,与scala或Java spark程序不同,它显示每个Stage对应于哪一行代码,我找不到哪个Stage对应于pyspark代码中的哪一行代码.
有没有办法可以找出哪个Stage对应于pyspark代码的哪一行?
谢谢!
我有以下数据框my_df
:
team member
--------------------
A Mary
B John
C Amy
A Dan
B Dave
D Paul
B Alex
A Mary
D Mary
Run Code Online (Sandbox Code Playgroud)
我希望新输出新数据框new_df
为:
team members number
--------------------------------------
A [Mary,Dan] 2
B [John,Dave,Alex] 3
C [Amy] 1
D [Paul,Mary] 2
Run Code Online (Sandbox Code Playgroud)
我想知道是否有任何现有的pandas功能可以执行上述任务?谢谢!
我正在尝试使用 sklearn MinMaxScaler 重新调整如下所示的 python 列:
scaler = MinMaxScaler()
y = scaler.fit(df['total_amount'])
Run Code Online (Sandbox Code Playgroud)
但得到以下错误:
Traceback (most recent call last):
File "/Users/edamame/workspace/git/my-analysis/experiments/my_seq.py", line 54, in <module>
y = scaler.fit(df['total_amount'])
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 308, in fit
return self.partial_fit(X, y)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/utils/validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[3.180000e+00 2.937450e+03 6.023850e+03 2.216292e+04 1.074589e+04
:
0.000000e+00 0.000000e+00 9.000000e+01 1.260000e+03].
Reshape your data either using array.reshape(-1, …
Run Code Online (Sandbox Code Playgroud) 我正在运行一个Spark工作,似乎任务分配不好(见附件).有没有办法让任务更均匀地分布?谢谢!
我在AWS上启动了一个Spark集群,其中包含一个主集群和60个核心:
这是启动命令,每个核心基本上有2个执行程序,共有120个执行程序:
spark-submit --deploy-mode cluster --master yarn-cluster --driver-memory 180g --driver-cores 26 --executor-memory 90g --executor-cores 13 --num-executors 120
Run Code Online (Sandbox Code Playgroud)
但是,在求职者中,只有119个执行者:
我以为应该有1个驱动程序+ 120个执行程序.但是,我看到的是119个执行者,其中包括1个驱动程序+ 118个工作执行程序.
这是否意味着我的主节点没有被使用?驱动程序是在主节点还是核心节点上运行?我可以让驱动程序在主节点上运行,让60个核心托管120个工作执行程序吗?
谢谢!
我有以下数据框my_df
:
col_A col_B
---------------
John []
Mary ['A','B','C']
Ann ['B','C']
Run Code Online (Sandbox Code Playgroud)
我想删除col_B
有空列表的行。即我希望新的数据框是:
col_A col_B
---------------
Mary ['A','B','C']
Ann ['B','C']
Run Code Online (Sandbox Code Playgroud)
以下是我所做的:
my_df[ len(my_df['col_B']) >0 ]
Run Code Online (Sandbox Code Playgroud)
但我收到以下错误:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.4/dist-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()
KeyError: True
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent …
Run Code Online (Sandbox Code Playgroud) 我正在使用 Keras 构建神经网络模型:
model_keras = Sequential()
model_keras.add(Dense(4, input_dim=input_num, activation='relu',kernel_regularizer=regularizers.l2(0.01)))
model_keras.add(Dense(1, activation='linear',kernel_regularizer=regularizers.l2(0.01)))
sgd = optimizers.SGD(lr=0.01, clipnorm=0.5)
model_keras.compile(loss='mean_squared_error', optimizer=sgd)
model_keras.fit(X_norm_train, y_norm_train, batch_size=20, epochs=100)
Run Code Online (Sandbox Code Playgroud)
输出如下所示。我想知道是否有可能消除损失,比如每 10 个时代而不是每个时代?谢谢!
Epoch 1/200
20/20 [==============================] - 0s - loss: 0.2661
Epoch 2/200
20/20 [==============================] - 0s - loss: 0.2625
Epoch 3/200
20/20 [==============================] - 0s - loss: 0.2590
Epoch 4/200
20/20 [==============================] - 0s - loss: 0.2556
Epoch 5/200
20/20 [==============================] - 0s - loss: 0.2523
Epoch 6/200
20/20 [==============================] - 0s - loss: 0.2490
Epoch 7/200 …
Run Code Online (Sandbox Code Playgroud) apache-spark ×5
python-3.x ×5
pandas ×3
keras ×2
scala ×2
distributed ×1
emr ×1
filter ×1
java ×1
pyspark ×1
python ×1
random-seed ×1
rdd ×1
scikit-learn ×1
tensorflow ×1