Pandas有条件地创建一个系列/数据帧列

Question

Pandas有条件地创建一个系列/数据帧列

use*_*289 260 python numpy dataframe pandas

我有一个沿着下面的数据框:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

Run Code Online (Sandbox Code Playgroud)

我想在数据帧中添加另一列(或生成一系列)与数据帧相同的长度(=相等的记录/行数),如果Set ='Z'则设置颜色为绿色,如果Set =否则设置为'red' .

最好的方法是什么？

Answer 1

unu*_*tbu 591

如果您只有两个选择可供选择:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

Run Code Online (Sandbox Code Playgroud)

例如,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

Run Code Online (Sandbox Code Playgroud)

产量

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

Run Code Online (Sandbox Code Playgroud)

如果您有两个以上的条件,那么使用np.select.例如,如果你想color成为

yellow 什么时候 (df['Set'] == 'Z') & (df['Type'] == 'A')
否则blue何时(df['Set'] == 'Z') & (df['Type'] == 'B')
否则purple何时(df['Type'] == 'B')
否则black,

然后用

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

Run Code Online (Sandbox Code Playgroud)

产量

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

Run Code Online (Sandbox Code Playgroud)

@AmolSharma:使用`&`代替`和`.请参见http://stackoverflow.com/q/13589390/190597 (11认同)
df ['color'] = list(np.where(df ['Set'] =='Z','green','red'))将禁止pandas警告:试图在副本上设置一个值来自DataFrame的切片.尝试使用.loc [row_indexer,col_indexer] = value (2认同)

Answer 2

che*_*ard 100

列表理解是另一种有条件地创建另一列的方法.如果您在列中使用对象dtypes,就像在您的示例中一样,列表推导通常优于大多数其他方法.

列表理解示例:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

Run Code Online (Sandbox Code Playgroud)

%timeit测试:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

Run Code Online (Sandbox Code Playgroud)

@cheekybastard 或者不要这样做，因为 `.iterrows()` 是出了名的缓慢，并且在迭代时不应该修改 DataFrame。 (4认同)
请注意,有更大的数据帧(想想`pd.DataFrame({'Type':list('ABBC')*100000,'Set':list('ZZXY')*100000})``size`,`numpy. where`超过`map`,但列表理解为王(比'numpy.where`快约50%). (3认同)
如果条件需要多列信息，可以使用列表理解方法吗？我正在寻找这样的东西（这不起作用）：`df ['color'] = ['red'if（x ['Set'] =='Z'）＆（x ['Type'] == 'B'）否则为df中x的'绿色'] (2认同)
将它添加到数据框中，然后您可以通过行访问多个列：['red'if（row ['Set'] =='Z'）＆（row ['Type'] =='B'）else'green '为索引，在df.iterrows（）中排入 (2认同)

Answer 3

bla*_*ite 18

这是另一种为这只猫设置皮肤的方法,使用字典将新值映射到列表中的键:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

Run Code Online (Sandbox Code Playgroud)

它看起来像什么:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

Run Code Online (Sandbox Code Playgroud)

当你有许多ifelse类型的语句要做时(即要替换许多唯一值),这种方法可能非常强大.

当然,你总能做到这一点:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

Run Code Online (Sandbox Code Playgroud)

但是这种方法apply在我的机器上比上面的方法慢三倍.

你也可以这样做,使用dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

Run Code Online (Sandbox Code Playgroud)

更新：在 100,000,000 行、52 个字符串值中，“.apply()”需要 47 秒，而“.map()”只需要 5.91 秒。 (3认同)

Answer 4

ach*_*uva 16

另一种可以实现这一目标的方法是

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

Run Code Online (Sandbox Code Playgroud)

Answer 5

bli*_*bli 15

以下比此处计时方法慢,但我们可以根据多个列的内容计算额外列,并且可以为额外列计算两个以上的值.

仅使用"Set"列的简单示例:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)

Run Code Online (Sandbox Code Playgroud)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Run Code Online (Sandbox Code Playgroud)

考虑更多颜色和更多列的示例:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)

Run Code Online (Sandbox Code Playgroud)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

Run Code Online (Sandbox Code Playgroud)

Answer 6

Hos*_*ein 6

也许这是通过更新Pandas来实现的，但是到目前为止，我认为以下是该问题的最短和最佳答案。您可以根据需要使用一种或多种条件。

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"
print(df)

# result: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

Run Code Online (Sandbox Code Playgroud)

Answer 7

Myk*_*tko 5

您可以使用 pandas 方法where和mask：

\n

df[\'color\'] = \'green\'\ndf[\'color\'] = df[\'color\'].where(df[\'Set\']==\'Z\', other=\'red\')\n# Replace values where the condition is False\n

Run Code Online (Sandbox Code Playgroud)\n

或者

\n

df[\'color\'] = \'red\'\ndf[\'color\'] = df[\'color\'].mask(df[\'Set\']==\'Z\', other=\'green\')\n# Replace values where the condition is True\n

Run Code Online (Sandbox Code Playgroud)\n

或者，您可以将该方法transform与 lambda 函数一起使用：

\n

df[\'color\'] = df[\'Set\'].transform(lambda x: \'green\' if x == \'Z\' else \'red\')\n

Run Code Online (Sandbox Code Playgroud)\n

输出：

\n

  Type Set  color\n1    A   Z  green\n2    B   Z  green\n3    B   X    red\n4    C   Y    red\n

Run Code Online (Sandbox Code Playgroud)\n

@chai 的性能比较：

\n

import pandas as pd\nimport numpy as np\ndf = pd.DataFrame({\'Type\':list(\'ABBC\')*1000000, \'Set\':list(\'ZZXY\')*1000000})\n \n%timeit df[\'color1\'] = \'red\'; df[\'color1\'].where(df[\'Set\']==\'Z\',\'green\')\n%timeit df[\'color2\'] = [\'red\' if x == \'Z\' else \'green\' for x in df[\'Set\']]\n%timeit df[\'color3\'] = np.where(df[\'Set\']==\'Z\', \'red\', \'green\')\n%timeit df[\'color4\'] = df.Set.map(lambda x: \'red\' if x == \'Z\' else \'green\')\n\n397 ms \xc2\xb1 101 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n976 ms \xc2\xb1 241 ms per loop\n673 ms \xc2\xb1 139 ms per loop\n796 ms \xc2\xb1 182 ms per loop\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	11 年，10 月前
查看次数：	330361 次
最近记录：	5 年，11 月前