熊猫-熊猫出现时和默认情况

Question

熊猫-熊猫出现时和默认情况

我在python中有以下case语句，

pd_df['difficulty'] = 'Unknown'
pd_df['difficulty'][(pd_df['Time']<30) & (pd_df['Time']>0)] = 'Easy'
pd_df['difficulty'][(pd_df['Time']>=30) & (pd_df['Time']<=60)] = 'Meduim'
pd_df['difficulty'][pd_df['Time']>60] = 'Hard'

Run Code Online (Sandbox Code Playgroud)

但是，当我运行代码时，它将引发错误。

A value is trying to be set on a copy of a slice from a DataFrame

Run Code Online (Sandbox Code Playgroud)

Answer 1

cs9*_*s95 11

选项1
为了提高性能，请使用嵌套np.where条件。对于这种情况，您可以仅使用pd.Series.between，并且将相应地插入默认值。

pd_df['difficulty'] = np.where(
     pd_df['Time'].between(0, 30, inclusive=False), 
    'Easy', 
     np.where(
        pd_df['Time'].between(0, 30, inclusive=False), 'Medium', 'Unknown'
     )
)

Run Code Online (Sandbox Code Playgroud)

选项2
同样，使用np.select，这为添加条件提供了更多空间：

pd_df['difficulty'] = np.select(
    [
        pd_df['Time'].between(0, 30, inclusive=False), 
        pd_df['Time'].between(30, 60, inclusive=True)
    ], 
    [
        'Easy', 
        'Medium'
    ], 
    default='Unknown'
)

Run Code Online (Sandbox Code Playgroud)

选项3
另一个高效的解决方案包括loc：

pd_df['difficulty'] = 'Unknown'
pd_df.loc[pd_df['Time'].between(0, 30, inclusive=False), 'difficulty'] = 'Easy'
pd_df.loc[pd_df['Time'].between(30, 60, inclusive=True), 'difficulty'] = 'Medium'

Run Code Online (Sandbox Code Playgroud)

Answer 2

cot*_*ail 7

`case_when`

\n

从 pandas 2.2.0 开始，您可以case_when()在列上使用。只需使用默认值进行初始化并使用替换其中的值case_when()，它接受（条件，替换）元组列表。对于OP中的示例，我们可以使用以下内容。

\n

pd_df["difficulty"] = "Unknown"\npd_df["difficulty"] = pd_df["difficulty"].case_when([\n    (pd_df.eval("0 < Time < 30"), "Easy"), \n    (pd_df.eval("30 <= Time <= 60"), "Medium"), \n    (pd_df.eval("Time > 60"), "Hard")\n])\n

Run Code Online (Sandbox Code Playgroud)\n

\n

`loc`

\n

OP 的代码只需要loc通过. 特别是，他们已经使用适当的括号来单独评估链式条件。__setitem__()[]()&

\n

这种方法的基本思想是用一些默认值（例如"Unknown"）初始化列并根据条件（例如"Easy"if 0<Time<30）等更新行。

\n

当我计算此页面上给出的选项时，对于大型框架，loc方法是最快的（比np.select和嵌套快 4-5 倍np.where）。¹ .

\n

pd_df[\'difficulty\'] = \'Unknown\'\npd_df.loc[(pd_df[\'Time\']<30) & (pd_df[\'Time\']>0), \'difficulty\'] = \'Easy\'\npd_df.loc[(pd_df[\'Time\']>=30) & (pd_df[\'Time\']<=60), \'difficulty\'] = \'Medium\'\npd_df.loc[pd_df[\'Time\']>60, \'difficulty\'] = \'Hard\'\n

Run Code Online (Sandbox Code Playgroud)\n
\n

\n

¹：用于基准测试的代码。

\n

def loc(pd_df):\n    pd_df[\'difficulty\'] = \'Unknown\'\n    pd_df.loc[(pd_df[\'Time\']<30) & (pd_df[\'Time\']>0), \'difficulty\'] = \'Easy\'\n    pd_df.loc[(pd_df[\'Time\']>=30) & (pd_df[\'Time\']<=60), \'difficulty\'] = \'Medium\'\n    pd_df.loc[pd_df[\'Time\']>60, \'difficulty\'] = \'Hard\'\n    return pd_df\n\ndef np_select(pd_df):\n    pd_df[\'difficulty\'] = np.select([pd_df[\'Time\'].between(0, 30, inclusive=\'neither\'), pd_df[\'Time\'].between(30, 60, inclusive=\'both\'), pd_df[\'Time\']>60], [\'Easy\', \'Medium\', \'Hard\'], \'Unknown\')\n    return pd_df\n\ndef nested_np_where(pd_df):\n    pd_df[\'difficulty\'] = np.where(pd_df[\'Time\'].between(0, 30, inclusive=\'neither\'), \'Easy\', np.where(pd_df[\'Time\'].between(30, 60, inclusive=\'both\'), \'Medium\', np.where(pd_df[\'Time\'] > 60, \'Hard\', \'Unknown\')))\n    return pd_df\n\n\ndf = pd.DataFrame({\'Time\': np.random.default_rng().choice(120, size=15_000_000)-30})\n\n%timeit loc(df.copy())\n# 891 ms \xc2\xb1 6.14 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\n%timeit np_select(df.copy())\n# 3.93 s \xc2\xb1 100 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\n%timeit nested_np_where(df.copy())\n# 4.82 s \xc2\xb1 1.05 s per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	7 年，7 月前
查看次数：	6845 次
最近记录：	7 年，7 月前