Numpy"where"具有多种条件

Question

Numpy"where"具有多种条件

Poi*_*son 19 python numpy dataframe pandas

我尝试在数据框"df_energy"中添加一个新列"energy_class",如果"consumption_energy"值> 400则包含字符串"high",如果"consumption_energy"值在200和400之间则包含"medium",并且低"如果"consumption_energy"值低于200.我尝试从numpy使用np.where,但我看到numpy.where(condition[, x, y])只处理两个条件而不是像我的情况一样.

有什么好主意帮我吗？

先感谢您

Answer 1

Ale*_*der 27

你可以使用三元:

np.where(consumption_energy > 400, 'high', 
         (np.where(consumption_energy < 200, 'low', 'medium')))

Run Code Online (Sandbox Code Playgroud)

我认为这很棒，除非您有复杂的多重条件，那么这可能是一个大问题。 (3认同)

Answer 2

Mer*_*lin 21

试试这个:使用@Maxu中的设置

col         = 'consumption_energy'
conditions  = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices     = [ "high", 'medium', 'low' ]

df2["energy_class"] = np.select(conditions, choices, default=np.nan)


  consumption_energy energy_class
0                 459         high
1                 416         high
2                 186          low
3                 250       medium
4                 411         high
5                 210       medium
6                 343       medium
7                 328       medium
8                 208       medium
9                 223       medium

Run Code Online (Sandbox Code Playgroud)

这是太棒了。需要注意的是，这与大多数“ if / elif / else”函数的工作方式相同，因为如果满足第一个和第二个条件，则第一个条件适用，而不适用第二个条件。谢谢@Merlin (2认同)

Answer 3

Max*_*axU 13

我会在这里使用cut()方法,这将生成非常有效且节省内存的categorydtype:

In [124]: df
Out[124]:
   consumption_energy
0                 459
1                 416
2                 186
3                 250
4                 411
5                 210
6                 343
7                 328
8                 208
9                 223

In [125]: pd.cut(df.consumption_energy, [0, 200, 400, np.inf], labels=['low','medium','high'])
Out[125]:
0      high
1      high
2       low
3    medium
4      high
5    medium
6    medium
7    medium
8    medium
9    medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]

Run Code Online (Sandbox Code Playgroud)

谢谢我的男人，很抱歉对这么旧的话题发表评论！ (2认同)

Answer 4

小智 8

我喜欢保持代码干净.这就是我喜欢np.vectorize这样的任务的原因.

def conditions(x):
    if x > 400:
        return "High"
    elif x > 200:
        return "Medium"
    else:
        return "Low"

func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])

Run Code Online (Sandbox Code Playgroud)

然后使用以下命令将numpy数组添加为数据框中的列:

df_energy["energy_class"] = energy_class

Run Code Online (Sandbox Code Playgroud)

这种方法的优点是,如果您希望向列添加更复杂的约束,可以轻松完成.希望能帮助到你.

Answer 5

Gon*_*ica 6

让我们首先创建一个数据框，其中包含和1000000之间的随机数以用作测试01000

df_energy = pd.DataFrame({'consumption_energy': np.random.randint(0, 1000, 1000000)})

[Out]:

   consumption_energy
0                 683
1                 893
2                 545
3                  13
4                 768
5                 385
6                 644
7                 551
8                 572
9                 822

Run Code Online (Sandbox Code Playgroud)

数据框的一些描述

print(df.energy.describe())

[Out]:
       consumption_energy
count      1000000.000000
mean           499.648532
std            288.600140
min              0.000000
25%            250.000000
50%            499.000000
75%            750.000000
max            999.000000

Run Code Online (Sandbox Code Playgroud)

有多种方法可以实现这一目标，例如：

使用numpy.where

df_energy['energy_class'] = np.where(df_energy['consumption_energy'] > 400, 'high', np.where(df_energy['consumption_energy'] > 200, 'medium', 'low'))

Run Code Online (Sandbox Code Playgroud)

使用numpy.select

df_energy['energy_class'] = np.select([df_energy['consumption_energy'] > 400, df_energy['consumption_energy'] > 200], ['high', 'medium'], default='low')

Run Code Online (Sandbox Code Playgroud)

使用numpy.vectorize

df_energy['energy_class'] = np.vectorize(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))(df_energy['consumption_energy'])

Run Code Online (Sandbox Code Playgroud)

使用pandas.cut

df_energy['energy_class'] = pd.cut(df_energy['consumption_energy'], bins=[0, 200, 400, 1000], labels=['low', 'medium', 'high'])

Run Code Online (Sandbox Code Playgroud)

使用Python的内置模块

def energy_class(x):
  if x > 400:
      return 'high'
  elif x > 200:
      return 'medium'
  else:
      return 'low'

df_energy['energy_class'] = df_energy['consumption_energy'].apply(energy_class)

Run Code Online (Sandbox Code Playgroud)

使用 lambda 函数

df_energy['energy_class'] = df_energy['consumption_energy'].apply(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))

Run Code Online (Sandbox Code Playgroud)

时间比较

从我所做的所有测试来看，通过测量时间time.perf_counter()（对于测量执行时间的其他方法，请参阅此），pandas.cut是最快的方法。

                        method      time
0                   np.where()  0.124139
1                  np.select()  0.155879
2            numpy.vectorize()  0.452789
3                 pandas.cut()  0.046143
4  Python's built-in functions  0.138021
5              lambda function   0.19081

Run Code Online (Sandbox Code Playgroud)

笔记：

pandas.cut对于和之间的区别pandas.qcut，请参阅：What is the Difference Between pandas.qcut and pandas.cut?

归档时间：	9 年前
查看次数：	40631 次
最近记录：	6 年，4 月前