我在python中有以下case语句,
pd_df['difficulty'] = 'Unknown'
pd_df['difficulty'][(pd_df['Time']<30) & (pd_df['Time']>0)] = 'Easy'
pd_df['difficulty'][(pd_df['Time']>=30) & (pd_df['Time']<=60)] = 'Meduim'
pd_df['difficulty'][pd_df['Time']>60] = 'Hard'
Run Code Online (Sandbox Code Playgroud)
但是,当我运行代码时,它将引发错误。
A value is trying to be set on a copy of a slice from a DataFrame
Run Code Online (Sandbox Code Playgroud)
cs9*_*s95 11
选项1
为了提高性能,请使用嵌套np.where
条件。对于这种情况,您可以仅使用pd.Series.between
,并且将相应地插入默认值。
pd_df['difficulty'] = np.where(
pd_df['Time'].between(0, 30, inclusive=False),
'Easy',
np.where(
pd_df['Time'].between(0, 30, inclusive=False), 'Medium', 'Unknown'
)
)
Run Code Online (Sandbox Code Playgroud)
选项2
同样,使用np.select
,这为添加条件提供了更多空间:
pd_df['difficulty'] = np.select(
[
pd_df['Time'].between(0, 30, inclusive=False),
pd_df['Time'].between(30, 60, inclusive=True)
],
[
'Easy',
'Medium'
],
default='Unknown'
)
Run Code Online (Sandbox Code Playgroud)
选项3
另一个高效的解决方案包括loc
:
pd_df['difficulty'] = 'Unknown'
pd_df.loc[pd_df['Time'].between(0, 30, inclusive=False), 'difficulty'] = 'Easy'
pd_df.loc[pd_df['Time'].between(30, 60, inclusive=True), 'difficulty'] = 'Medium'
Run Code Online (Sandbox Code Playgroud)
case_when
从 pandas 2.2.0 开始,您可以case_when()
在列上使用。只需使用默认值进行初始化并使用 替换其中的值case_when()
,它接受(条件,替换)元组列表。对于OP中的示例,我们可以使用以下内容。
pd_df["difficulty"] = "Unknown"\npd_df["difficulty"] = pd_df["difficulty"].case_when([\n (pd_df.eval("0 < Time < 30"), "Easy"), \n (pd_df.eval("30 <= Time <= 60"), "Medium"), \n (pd_df.eval("Time > 60"), "Hard")\n])\n
Run Code Online (Sandbox Code Playgroud)\nloc
OP 的代码只需要loc
通过. 特别是,他们已经使用适当的括号来单独评估链式条件。__setitem__()
[]
()
&
这种方法的基本思想是用一些默认值(例如"Unknown"
)初始化列并根据条件(例如"Easy"
if 0<Time<30
)等更新行。
当我计算此页面上给出的选项时,对于大型框架,loc
方法是最快的(比np.select
和 嵌套快 4-5 倍np.where
)。1 .
pd_df[\'difficulty\'] = \'Unknown\'\npd_df.loc[(pd_df[\'Time\']<30) & (pd_df[\'Time\']>0), \'difficulty\'] = \'Easy\'\npd_df.loc[(pd_df[\'Time\']>=30) & (pd_df[\'Time\']<=60), \'difficulty\'] = \'Medium\'\npd_df.loc[pd_df[\'Time\']>60, \'difficulty\'] = \'Hard\'\n
Run Code Online (Sandbox Code Playgroud)\n 1:用于基准测试的代码。
\ndef loc(pd_df):\n pd_df[\'difficulty\'] = \'Unknown\'\n pd_df.loc[(pd_df[\'Time\']<30) & (pd_df[\'Time\']>0), \'difficulty\'] = \'Easy\'\n pd_df.loc[(pd_df[\'Time\']>=30) & (pd_df[\'Time\']<=60), \'difficulty\'] = \'Medium\'\n pd_df.loc[pd_df[\'Time\']>60, \'difficulty\'] = \'Hard\'\n return pd_df\n\ndef np_select(pd_df):\n pd_df[\'difficulty\'] = np.select([pd_df[\'Time\'].between(0, 30, inclusive=\'neither\'), pd_df[\'Time\'].between(30, 60, inclusive=\'both\'), pd_df[\'Time\']>60], [\'Easy\', \'Medium\', \'Hard\'], \'Unknown\')\n return pd_df\n\ndef nested_np_where(pd_df):\n pd_df[\'difficulty\'] = np.where(pd_df[\'Time\'].between(0, 30, inclusive=\'neither\'), \'Easy\', np.where(pd_df[\'Time\'].between(30, 60, inclusive=\'both\'), \'Medium\', np.where(pd_df[\'Time\'] > 60, \'Hard\', \'Unknown\')))\n return pd_df\n\n\ndf = pd.DataFrame({\'Time\': np.random.default_rng().choice(120, size=15_000_000)-30})\n\n%timeit loc(df.copy())\n# 891 ms \xc2\xb1 6.14 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\n%timeit np_select(df.copy())\n# 3.93 s \xc2\xb1 100 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\n%timeit nested_np_where(df.copy())\n# 4.82 s \xc2\xb1 1.05 s per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n
归档时间: |
|
查看次数: |
6845 次 |
最近记录: |