我正在尝试使用两列连接两个pandas数据帧:
new_df = pd.merge(A_df, B_df, how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')
Run Code Online (Sandbox Code Playgroud)
但得到以下错误:
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()
KeyError: '[B_1, c2]'
Run Code Online (Sandbox Code Playgroud)
知道应该采取什么样的正确方法吗?谢谢!
我有一个pandas数据框my_df,在那里我可以找到给定列的mean(),median(),mode():
my_df['field_A'].mean()
my_df['field_A'].median()
my_df['field_A'].mode()
Run Code Online (Sandbox Code Playgroud)
我想知道是否有可能找到更详细的统计数据,如90%?谢谢!
我使用以下代码绘制条形图:
import matplotlib.pyplot as pls
my_df.plot(x='my_timestampe', y='col_A', kind='bar')
plt.show()
Run Code Online (Sandbox Code Playgroud)
情节很好.但是,我希望通过在列表中包含3列:'col_A','col_B'和'col_C'来改进图形.如下图所示:
我希望col_A
在x轴上方显示为蓝色,col_B
在x轴下方显示为红色,在x轴上方显示col_C
为绿色.这是matplotlib中的可能吗?如何更改以绘制所有三列?谢谢!
我想使用下面的代码转换spark数据框:
from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")
Run Code Online (Sandbox Code Playgroud)
详细的错误消息是:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-a19a1763d3ac> in <module>()
1 from pyspark.mllib.clustering import KMeans
2 spark_df = sqlContext.createDataFrame(pandas_df)
----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
842 if name not in self.columns:
843 raise AttributeError(
--> 844 "'%s' object …
Run Code Online (Sandbox Code Playgroud) python apache-spark pyspark spark-dataframe apache-spark-mllib
我正在学习TensorFlow,构建一个多层感知器模型.我正在研究一些例子:https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/multilayer_perceptron.ipynb
然后,我在下面的代码中有一些问题:
def multilayer_perceptron(x, weights, biases):
:
:
pred = multilayer_perceptron(x, weights, biases)
:
:
with tf.Session() as sess:
sess.run(init)
:
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print ("Accuracy:", accuracy.eval({x: X_test, y: y_test_onehot}))
Run Code Online (Sandbox Code Playgroud)
我想知道做什么tf.argmax(prod,1)
和tf.argmax(y,1)
意味着什么,返回(类型和价值)究竟是什么?并且是correct_prediction
变量而不是实际值?
最后,我们如何从tf会话中获取y_test_prediction
数组(输入数据时的预测结果X_test
)?非常感谢!
我使用以下代码从列表中创建数据框:
test_list = ['a','b','c','d']
df_test = pd.DataFrame.from_records(test_list, columns=['my_letters'])
df_test
Run Code Online (Sandbox Code Playgroud)
上面的代码工作正常.然后我尝试了另一个列表的相同方法:
import pandas as pd
q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
df1 = pd.DataFrame.from_records(q_list, columns=['q_data'])
df1
Run Code Online (Sandbox Code Playgroud)
但这次它给了我以下错误:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-24-99e7b8e32a52> in <module>()
1 import pandas as pd
2 q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
----> 3 df1 = pd.DataFrame.from_records(q_list, columns=['q_data'])
4 df1
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1021 else:
1022 arrays, arr_columns = _to_arrays(data, columns,
-> 1023 coerce_float=coerce_float) …
Run Code Online (Sandbox Code Playgroud) 我正在使用pyspark来阅读下面的镶木地板文件:
my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')
Run Code Online (Sandbox Code Playgroud)
然后,当我这样做时my_df.take(5)
,它将显示[Row(...)]
,而不是像我们使用pandas数据帧时的表格格式.
是否可以以pandas数据帧等表格格式显示数据帧?谢谢!
我有以下火花工作,试图将一切都记在内存中:
val myOutRDD = myInRDD.flatMap { fp =>
val tuple2List: ListBuffer[(String, myClass)] = ListBuffer()
:
tuple2List
}.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1, p2) =>
myMergeFunction(p1,p2)
}.persist(StorageLevel.MEMORY_ONLY)
Run Code Online (Sandbox Code Playgroud)
但是,当我查看作业跟踪器时,我仍然有很多Shuffle Write和Shuffle溢出到磁盘......
Total task time across all tasks: 49.1 h
Input Size / Records: 21.6 GB / 102123058
Shuffle write: 532.9 GB / 182440290
Shuffle spill (memory): 370.7 GB
Shuffle spill (disk): 15.4 GB
Run Code Online (Sandbox Code Playgroud)
然后这个工作失败了因为"no space left on device"
......我想知道532.9 GB Shuffle写在这里,是写入磁盘还是内存?
另外,为什么还有15.4 G数据溢出到磁盘,而我特意要求将它们保存在内存中?
谢谢!
在pandas数据框中,我使用以下代码绘制列的直方图:
my_df.hist(column = 'field_1')
Run Code Online (Sandbox Code Playgroud)
在pyspark数据框架中是否有可以实现相同目标的东西?(我在Jupyter笔记本中)谢谢!
我有以下3个数组:
int[] indexes = new int[]{0,2,8,5};
String[] sources = new String[]{"how", "are", "today", "you"};
String[] targets = new String[]{"I", "am", "thanks", "fine"};
Run Code Online (Sandbox Code Playgroud)
我想根据索引对三个数组进行排序:
indexes -> {0,2,5,8}
sources -> {"how", "are", "you", "today"}
targets -> {"I", "am", "fine", "thanks"}
Run Code Online (Sandbox Code Playgroud)
我可以创建一个myClass
包含所有三个元素的新类:
class myClass {
int x;
String source;
String target;
}
Run Code Online (Sandbox Code Playgroud)
将所有内容重新分配给myClass,然后myClass
使用排序x
.但是,这需要额外的空间.我想知道是否可以进行in place
排序?谢谢!
python ×6
pandas ×5
pyspark ×3
python-3.x ×3
apache-spark ×2
arrays ×1
bar-chart ×1
dataframe ×1
in-place ×1
java ×1
join ×1
list ×1
matplotlib ×1
persist ×1
python-2.7 ×1
rdd ×1
shuffle ×1
sorting ×1
statistics ×1
tensorflow ×1