大熊猫按条件将df转换成矩阵

Tyl*_*rNG 4 python numpy dataframe pandas

是否有可能将df转换为如下所示的矩阵?鉴于df:

Name Value
x    5
x    2
x    3
x    3
y    3
y    2
z    4
Run Code Online (Sandbox Code Playgroud)

矩阵将是:

Name    1    2    3   4   5   
x       4    4    3   1   1
y       2    2    1   0   0
z       1    1    1   1   0
Run Code Online (Sandbox Code Playgroud)

这是它背后的逻辑:

Name    1    2    3  4    5   (5 columns since 5 is the max in Value)
--------------------------------------------------------------------
x       4 (since x has 4 values >= 1)     4 (since x has 4 values >= 2)    3 (since x has 3 values >= 3)   1 (since x has 1 values >= 4)   1 (since 1 x >= 5)
y       2 (since y has 2 values >= 1)     2 (since y has 2 values >= 2)    1 (since y has 1 values >= 3)   0 (since no more y >= 5)        0 (since no more y >= 5)
z       1 (since z has 1 values >= 1)     1 (since z has 1 values >= 2)    1 (since z has 1 values >= 3)   1 (since z has 1 values >= 4)   0 (since no more z >= 5)
Run Code Online (Sandbox Code Playgroud)

如果这是有道理的,请告诉我.
我知道我必须使用sort,group和count但是无法弄清楚如何设置矩阵.

谢谢!!!

cs9*_*s95 8

可能是最快的解决方案,使用numpy广播 -

i = np.arange(1, df.Value.max() + 1)
j = df.Value.values[:, None] >= i

df = pd.DataFrame(j, columns=i, index=df.Name).sum(level=0)

        1    2    3    4    5
Name                         
x     4.0  4.0  3.0  1.0  1.0
y     2.0  2.0  1.0  0.0  0.0
z     1.0  1.0  1.0  1.0  0.0
Run Code Online (Sandbox Code Playgroud)

警告:为了换取性能,这有点像一种记忆饥渴的方法.对于大数据,可能会导致内存爆裂,因此请慎重使用.


细节

创建一系列值,从- 1df.Value.max()

i = np.arange(1, df.Value.max() + 1)
i
array([1, 2, 3, 4, 5])
Run Code Online (Sandbox Code Playgroud)

与执行广播比较df.Valuesi-

j = df.Value.values[:, None] >= i
j

array([[ True,  True,  True,  True,  True],
       [ True,  True, False, False, False],
       [ True,  True,  True, False, False],
       [ True,  True,  True, False, False],
       [ True,  True,  True, False, False],
       [ True,  True, False, False, False],
       [ True,  True,  True,  True, False]], dtype=bool)
Run Code Online (Sandbox Code Playgroud)

将其加载到数据框中,然后执行分组求和df.Name以获得最终结果.

k = pd.DataFrame(j, columns=i, index=df.Name)
k
         1     2      3      4      5
Name                                 
x     True  True   True   True   True
x     True  True  False  False  False
x     True  True   True  False  False
x     True  True   True  False  False
y     True  True   True  False  False
y     True  True  False  False  False
z     True  True   True   True  False
Run Code Online (Sandbox Code Playgroud)
k.sum(level=0)

        1    2    3    4    5
Name                         
x     4.0  4.0  3.0  1.0  1.0
y     2.0  2.0  1.0  0.0  0.0
z     1.0  1.0  1.0  1.0  0.0
Run Code Online (Sandbox Code Playgroud)

如果您需要将结果转换为整数,请致电.astype(int)-

k.sum(level=0).astype(int)

      1  2  3  4  5
Name               
x     4  4  3  1  1
y     2  2  1  0  0
z     1  1  1  1  0
Run Code Online (Sandbox Code Playgroud)