除零之外的数据帧的每列中的最小值

Pol*_*ova 6 python types dataframe pandas

原始数据框是这样的表:

                        S1_r1_ctrl/     S1_r2_ctrl/     S1_r3_ctrl/
sp|P38646|GRP75_HUMAN   2.960000e-06    5.680000e-06    0.000000e+00
sp|O75694-2|NU155_HUMAN 2.710000e-07    0.000000e+00    2.180000e-07
sp|Q05397-2|FAK1_HUMAN  0.000000e+00    2.380000e-07    7.330000e-06
sp|O60671-2|RAD1_HUMAN  NaN             NaN             NaN
Run Code Online (Sandbox Code Playgroud)

我正在寻找大于零的数据帧的每列中的最小值.我试图用这个例子回答我的问题.我的代码看起来像:

df.ne(0).idxmin().to_frame('pos').assign(value=lambda d: df.lookup(d.pos, d.index))

但我仍然只得到零,我的结果看起来像这样:

            pos                     value

S1_r1_ctrl/ sp|Q05397-2|FAK1_HUMAN  0.0
S1_r2_ctrl/ sp|O75694-2|NU155_HUMAN 0.0
S1_r3_ctrl/ sp|P38646|GRP75_HUMAN   0.0
Run Code Online (Sandbox Code Playgroud)

而不是这个:

            pos                     value
S1_r1_ctrl/ sp|O75694-2|NU155_HUMAN 2.710000e-07
S1_r2_ctrl/ sp|Q05397-2|FAK1_HUMAN  2.380000e-07
S1_r3_ctrl/ sp|O75694-2|NU155_HUMAN 2.180000e-07
Run Code Online (Sandbox Code Playgroud)

我想数据类型可能存在问题,但我不确定.我假设ne(0)会忽略零,但事实并非如此,我很困惑.也许有更聪明的方法来找到我需要的东西.

use*_*203 7

建立

df = pd.DataFrame([[0, 0, 0],
                   [0, 10, 0],
                   [4, 0, 0],
                   [1, 2, 3]],
                  columns=['first', 'second', 'third'])
Run Code Online (Sandbox Code Playgroud)

使用面具min(0):

df[df.gt(0)].min(0)

first     1.0
second    2.0
third     3.0
dtype: float64
Run Code Online (Sandbox Code Playgroud)

正如@DSM指出的那样,这也可以写成:

df.where(df.gt(0)).min(0)
Run Code Online (Sandbox Code Playgroud)

性能

def chris():
    df1[df1.gt(0)].min(0)

def chris2():
    df1.where(df1.gt(0)).min(0)

def wen():
    a=df1.values.T
    a = np.ma.masked_equal(a, 0.0, copy=False)
    a.min(1)

def haleemur():
    df1.replace(0, np.nan).min()
Run Code Online (Sandbox Code Playgroud)

建立

from timeit import timeit
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['chris', 'chris2', 'wen', 'haleemur'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        df1 = df.copy()
        df1 = pd.concat([df1]*c)
        stmt = '{}()'.format(f)
        setp = 'from __main__ import df1, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()
Run Code Online (Sandbox Code Playgroud)

结果

在此输入图像描述


WeN*_*Ben 5

也许numpy是不错的选择

a=df.values.T
a = np.ma.masked_equal(a, 0.0, copy=False)
a.min(1)
Out[755]: 
masked_array(data=[1, 2, 3],
             mask=[False, False, False],
       fill_value=999999,
            dtype=int64)
Run Code Online (Sandbox Code Playgroud)