熊猫-熊猫出现时和默认情况

Tom*_*thi 7 python pandas

我在python中有以下case语句,

pd_df['difficulty'] = 'Unknown'
pd_df['difficulty'][(pd_df['Time']<30) & (pd_df['Time']>0)] = 'Easy'
pd_df['difficulty'][(pd_df['Time']>=30) & (pd_df['Time']<=60)] = 'Meduim'
pd_df['difficulty'][pd_df['Time']>60] = 'Hard'
Run Code Online (Sandbox Code Playgroud)

但是,当我运行代码时,它将引发错误。

A value is trying to be set on a copy of a slice from a DataFrame
Run Code Online (Sandbox Code Playgroud)

cs9*_*s95 11

选项1
为了提高性能,请使用嵌套np.where条件。对于这种情况,您可以仅使用pd.Series.between,并且将相应地插入默认值。

pd_df['difficulty'] = np.where(
     pd_df['Time'].between(0, 30, inclusive=False), 
    'Easy', 
     np.where(
        pd_df['Time'].between(0, 30, inclusive=False), 'Medium', 'Unknown'
     )
)
Run Code Online (Sandbox Code Playgroud)

选项2
同样,使用np.select,这为添加条件提供了更多空间:

pd_df['difficulty'] = np.select(
    [
        pd_df['Time'].between(0, 30, inclusive=False), 
        pd_df['Time'].between(30, 60, inclusive=True)
    ], 
    [
        'Easy', 
        'Medium'
    ], 
    default='Unknown'
)
Run Code Online (Sandbox Code Playgroud)

选项3
另一个高效的解决方案包括loc

pd_df['difficulty'] = 'Unknown'
pd_df.loc[pd_df['Time'].between(0, 30, inclusive=False), 'difficulty'] = 'Easy'
pd_df.loc[pd_df['Time'].between(30, 60, inclusive=True), 'difficulty'] = 'Medium'
Run Code Online (Sandbox Code Playgroud)


cot*_*ail 7

case_when

\n

从 pandas 2.2.0 开始,您可以case_when()在列上使用。只需使用默认值进行初始化并使用 替换其中的值case_when(),它接受(条件,替换)元组列表。对于OP中的示例,我们可以使用以下内容。

\n
pd_df["difficulty"] = "Unknown"\npd_df["difficulty"] = pd_df["difficulty"].case_when([\n    (pd_df.eval("0 < Time < 30"), "Easy"), \n    (pd_df.eval("30 <= Time <= 60"), "Medium"), \n    (pd_df.eval("Time > 60"), "Hard")\n])\n
Run Code Online (Sandbox Code Playgroud)\n
\n

loc

\n

OP 的代码只需要loc通过. 特别是,他们已经使用适当的括号来单独评估链式条件。__setitem__()[]()&

\n

这种方法的基本思想是用一些默认值(例如"Unknown")初始化列并根据条件(例如"Easy"if 0<Time<30)等更新行。

\n

当我计算此页面上给出的选项时,对于大型框架,loc方法是最快的(比np.select和 嵌套快 4-5 倍np.where)。1 .

\n
pd_df[\'difficulty\'] = \'Unknown\'\npd_df.loc[(pd_df[\'Time\']<30) & (pd_df[\'Time\']>0), \'difficulty\'] = \'Easy\'\npd_df.loc[(pd_df[\'Time\']>=30) & (pd_df[\'Time\']<=60), \'difficulty\'] = \'Medium\'\npd_df.loc[pd_df[\'Time\']>60, \'difficulty\'] = \'Hard\'\n
Run Code Online (Sandbox Code Playgroud)\n
\n
\n

1:用于基准测试的代码。

\n
def loc(pd_df):\n    pd_df[\'difficulty\'] = \'Unknown\'\n    pd_df.loc[(pd_df[\'Time\']<30) & (pd_df[\'Time\']>0), \'difficulty\'] = \'Easy\'\n    pd_df.loc[(pd_df[\'Time\']>=30) & (pd_df[\'Time\']<=60), \'difficulty\'] = \'Medium\'\n    pd_df.loc[pd_df[\'Time\']>60, \'difficulty\'] = \'Hard\'\n    return pd_df\n\ndef np_select(pd_df):\n    pd_df[\'difficulty\'] = np.select([pd_df[\'Time\'].between(0, 30, inclusive=\'neither\'), pd_df[\'Time\'].between(30, 60, inclusive=\'both\'), pd_df[\'Time\']>60], [\'Easy\', \'Medium\', \'Hard\'], \'Unknown\')\n    return pd_df\n\ndef nested_np_where(pd_df):\n    pd_df[\'difficulty\'] = np.where(pd_df[\'Time\'].between(0, 30, inclusive=\'neither\'), \'Easy\', np.where(pd_df[\'Time\'].between(30, 60, inclusive=\'both\'), \'Medium\', np.where(pd_df[\'Time\'] > 60, \'Hard\', \'Unknown\')))\n    return pd_df\n\n\ndf = pd.DataFrame({\'Time\': np.random.default_rng().choice(120, size=15_000_000)-30})\n\n%timeit loc(df.copy())\n# 891 ms \xc2\xb1 6.14 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\n%timeit np_select(df.copy())\n# 3.93 s \xc2\xb1 100 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\n%timeit nested_np_where(df.copy())\n# 4.82 s \xc2\xb1 1.05 s per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n