我希望使用pandas作为主要Trace(来自MCMC的参数空间中的一系列点)对象.
我有一个string-> array的dicts列表,我想在pandas中存储它.dicts中的键总是相同的,并且对于每个键,numpy数组的形状总是相同的,但是对于不同的键,形状可以是不同的,并且可以具有不同数量的维度.
我一直在使用self.append(dict_list, ignore_index = True)
它似乎适用于1d值,但对于nd> 1值pandas将值存储为对象,不允许很好的绘图和其他好东西.关于如何改善行为的任何建议?
样本数据
point = {'x': array(-0.47652306228698005),
'y': array([[-0.41809043],
[ 0.48407823]])}
points = 10 * [ point]
Run Code Online (Sandbox Code Playgroud)
我希望能够做类似的事情
df = DataFrame(points)
Run Code Online (Sandbox Code Playgroud)
要么
df = DataFrame()
df.append(points, ignore_index=True)
Run Code Online (Sandbox Code Playgroud)
并有
>> df['x'][1].shape
()
>> df['y'][1].shape
(2,1)
Run Code Online (Sandbox Code Playgroud)
ank*_*tis 10
相对较新的库xray [1]具有完全符合您要求的结构Dataset
和DataArray
结构.
这是我对你的问题的看法,写成一个IPython会话:
>>> import numpy as np
>>> import xray
>>> ## Prepare data:
>>> #
>>> point = {'x': np.array(-0.47652306228698005),
... 'y': np.array([[-0.41809043],
... [ 0.48407823]])}
>>> points = 10 * [point]
>>> ## Convert to Xray DataArrays:
>>> #
>>> list_x = [p['x'] for p in points]
>>> list_y = [p['y'] for p in points]
>>> da_x = xray.DataArray(list_x, [('x', range(len(list_x)))])
>>> da_y = xray.DataArray(list_y, [
... ('x', range(len(list_y))),
... ('y0', range(2)),
... ('y1', [0]),
... ])
Run Code Online (Sandbox Code Playgroud)
这是DataArray
我们到目前为止构建的两个实例:
>>> print(da_x)
<xray.DataArray (x: 10)>
array([-0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306,
-0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306])
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
>>> print(da_y.T) ## Transposed, to save lines.
<xray.DataArray (y1: 1, y0: 2, x: 10)>
array([[[-0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043,
-0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043],
[ 0.48407823, 0.48407823, 0.48407823, 0.48407823, 0.48407823,
0.48407823, 0.48407823, 0.48407823, 0.48407823, 0.48407823]]])
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
* y0 (y0) int32 0 1
* y1 (y1) int32 0
Run Code Online (Sandbox Code Playgroud)
我们现在可以将DataArray
它们的共同x
维度合并为DataSet
:
>>> ds = xray.Dataset({'X':da_x, 'Y':da_y})
>>> print(ds)
<xray.Dataset>
Dimensions: (x: 10, y0: 2, y1: 1)
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
* y0 (y0) int32 0 1
* y1 (y1) int32 0
Data variables:
X (x) float64 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 ...
Y (x, y0, y1) float64 -0.4181 0.4841 -0.4181 0.4841 -0.4181 0.4841 -0.4181 ...
Run Code Online (Sandbox Code Playgroud)
我们最终可以按照您想要的方式访问和汇总数据:
>>> ds['X'].sum()
<xray.DataArray 'X' ()>
array(-4.765230622869801)
>>> ds['Y'].sum()
<xray.DataArray 'Y' ()>
array(0.659878)
>>> ds['Y'].sum(axis=1)
<xray.DataArray 'Y' (x: 10, y1: 1)>
array([[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878]])
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
* y1 (y1) int32 0
>>> np.all(ds['Y'].sum(axis=1) == ds['Y'].sum(dim='y0'))
True
>>>> ds['X'].sum(dim='y0')
Traceback (most recent call last):
ValueError: 'y0' not found in array dimensions ('x',)
Run Code Online (Sandbox Code Playgroud)
[1]用于处理带有标签的N维数据的库,如pandas用于2D:http://xray.readthedocs.org/en/stable/data-structures.html#dataset
结合@Eike 的答案和@JohnSalvatier的评论,看起来很像Pandasonic:
>>> import pandas as pd
>>> np = pandas.np
>>> point = {'x': np.array(-0.47652306228698005),
... 'y': np.array([[-0.41809043],
... [ 0.48407823]])}
>>> points = 10 * [point] # this creates a list of 10 point dicts
>>> df = pd.DataFrame().append(points)
>>> df.x
# 0 -0.476523062287
# ...
# 9 -0.476523062287
# Name: x, dtype: object
>>> df.y
# 0 [[-0.41809043], [0.48407823]]
# ...
# 9 [[-0.41809043], [0.48407823]]
# Name: y, dtype: object
>>> df.y[0]
# array([[-0.41809043],
# [ 0.48407823]])
>>> df.y[0].shape
# (2, 1)
Run Code Online (Sandbox Code Playgroud)
要绘制(并执行其他所有酷炫的二维熊猫任务),您仍然必须手动将数组的列转换回DataFrame:
>>> dfy = pd.DataFrame([row.T[0] for row in df2.y])
>>> dfy += np.matrix([[0] * 10, range(10)]).T
>>> dfy *= np.matrix([range(10), range(10)]).T
>>> dfy.plot()
Run Code Online (Sandbox Code Playgroud)
要将其存储在磁盘上,请使用to_pickle
:
>>> df.to_pickle('/tmp/sotest.pickle')
>>> df2 = pd.read_pickle('/tmp/sotest.pickle')
>>> df.y[0].shape
# (2, 1)
Run Code Online (Sandbox Code Playgroud)
如果你用to_csv
你的np.array
小号成为字符串:
>>> df.to_csv('/tmp/sotest.csv')
>>> df2 = pd.DataFrame.from_csv('/tmp/sotest.csv')
>>> df2.y[0]
# '[[-0.41809043]\n [ 0.48407823]]'
Run Code Online (Sandbox Code Playgroud)
这有点违背 Pandas 的哲学,Pandas 似乎将其视为Series
一维数据结构。因此你必须Series
手动创建,告诉他们他们有数据类型"object"
。这意味着不应用任何自动数据转换。
你可以这样做(重新排序的 Ipython 会话):
In [9]: import pandas as pd
In [1]: point = {'x': array(-0.47652306228698005),
...: 'y': array([[-0.41809043],
...: [ 0.48407823]])}
In [2]: points = 10 * [ point]
In [5]: lx = [p["x"] for p in points]
In [7]: ly = [p["y"] for p in points]
In [40]: sx = pd.Series(lx, dtype=numpy.dtype("object"))
In [38]: sy = pd.Series(ly, dtype=numpy.dtype("object"))
In [43]: df = pd.DataFrame({"x":sx, "y":sy})
In [45]: df['x'][1].shape
Out[45]: ()
In [46]: df['y'][1].shape
Out[46]: (2, 1)
Run Code Online (Sandbox Code Playgroud)