use*_*545 3 python indexing multi-level pandas
有人可以帮助我完成这项任务吗?我通过 unstack() 操作在多级数据帧中有数据:
Original df:
Density Length Range Count
15k 0.60 small 555
15k 0.60 big 17
15k 1.80 small 141
15k 1.80 big 21
15k 3.60 small 150
15k 3.60 big 26
20k 0.60 small 5543
20k 0.60 big 22
20k 1.80 small 553
20k 1.80 big 25
20k 3.60 small 422
20k 3.60 big 35
df = df.set_index(['Density','Length','Range']).unstack('Range')
# After unstack:
Count
Range big small
Density Length
15k 0.60 17 555
1.80 21 141
3.60 26 150
20k 0.60 22 5543
1.80 25 553
3.60 35 422
Run Code Online (Sandbox Code Playgroud)
现在我尝试在级别 1 中添加一个额外的列。它是小/大的比率。我尝试了以下语法,没有错误,但结果不同
#df[:]['ratio']=df['Count']['small']/df['Count']['big'] ## case 1. no error, no ratio
#df['Count']['ratio']=df['Count']['small']/df['Count']['big'] ## case 2. no error, no ratio
#df['ratio']=df['Count']['small']/df['Count']['big'] ## case 3. no error, ratio on column level 0
df['ratio']=df.ix[:,1]/df.ix[:,0] ## case 4. no error, ratio on column level 0
#After execution above code, df:
Count ratio
Range big small
Density Length
15k 0.60 17 555 32.65
1.80 21 141 6.71
3.60 26 150 5.77
20k 0.60 22 5543 251.95
1.80 25 553 22.12
3.60 35 422 12.06
Run Code Online (Sandbox Code Playgroud)
我不明白为什么案例 1 和 2 既不添加新的比率列也没有显示错误。以及为什么在第 3 和第 4 种情况下,比率列位于第 0 级,而不是预期的第 1 级。还想知道是否有更好/更简洁的方法来实现这一点。案例 4 是我能做的最好的,但我不喜欢隐式索引方式(而不是使用名称)来引用列。
谢谢
案例一:
df[:]['ratio']=df['Count']['small']/df['Count']['big']
Run Code Online (Sandbox Code Playgroud)
df[:]是 的副本df。它们是不同的对象,每个对象都有自己的底层数据副本:
In [69]: df[:] is df
Out[69]: False
Run Code Online (Sandbox Code Playgroud)
所以修改副本对原来的df没有影响。由于没有为 维护引用df[:],对象在赋值后被垃圾回收,使得赋值无用。
案例2:
df['Count']['ratio']=df['Count']['small']/df['Count']['big']
Run Code Online (Sandbox Code Playgroud)
使用链索引。进行分配时避免链式索引。该链接解释了为什么在左侧使用链索引的赋值可能不会影响df.
如果你设置
pd.options.mode.chained_assignment = 'warn'
Run Code Online (Sandbox Code Playgroud)
然后 Pandas 会警告你不要在赋值中使用链索引:
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Run Code Online (Sandbox Code Playgroud)
案例3:
df['ratio']=df['Count']['small']/df['Count']['big']
Run Code Online (Sandbox Code Playgroud)
和案例 4
df['ratio']=df.ix[:,1]/df.ix[:,0]
Run Code Online (Sandbox Code Playgroud)
两者都有效,但使用它可以更有效地完成
df['ratio'] = df['Count','small']/df['Count','big']
Run Code Online (Sandbox Code Playgroud)
这是一个微基准测试,表明使用df[tuple_index]比链索引更快:
In [99]: %timeit df['Count']['small']
1000 loops, best of 3: 501 µs per loop
In [128]: %timeit df['Count','small']
100000 loops, best of 3: 8.91 µs per loop
Run Code Online (Sandbox Code Playgroud)
如果你想ratio成为 1 级标签,那么你必须告诉 Pandas 0 级标签是Count。您可以通过分配给df['Count','ratio']:
In [96]: df['Count','ratio'] = df['Count']['small']/df['Count','big']
# In [97]: df
# Out[97]:
# Count
# Range big small ratio
# Density Length
# 15k 0.6 17 555 32.647059
# 1.8 21 141 6.714286
# 3.6 26 150 5.769231
# 20k 0.6 22 5543 251.954545
# 1.8 25 553 22.120000
# 3.6 35 422 12.057143
Run Code Online (Sandbox Code Playgroud)