是否有更好的可读方式来在pandas中coalese列

use*_*423 10 python pandas

我经常需要一个新列,这是我可以从其他列中获得的最佳列,并且我有一个特定的首选项优先级列表.我愿意采取第一个非null值.

def coalesce(values):
    not_none = (el for el in values if el is not None)
    return next(not_none, None)

df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df
Run Code Online (Sandbox Code Playgroud)

结果

  first second third combo1 combo2
0     A      C     B      A      C
1  None      C     B      C      C
2  None   None     B      B      B
3  None   None  None   None   None
4     A   None     B      A      B
Run Code Online (Sandbox Code Playgroud)

这段代码有效(结果是我想要的),但速度不是很快.
如果我需要[['second','third','first']],我可以选择我的优先事项

Coalesce有点像tsql中同名的函数.
我怀疑我可能忽略了一种简单的方法来实现它,在大型DataFrame上具有良好的性能(+400,000行)

我知道有很多方法可以填补我经常在轴上使用的缺失数据= 0这就是让我觉得我可能错过了一个简单的选项,因为它= 1

你能建议一些更好/更快的东西......或者确认这是好的.

Bra*_*des 29

熊猫相当于COALESCE方法fillna():

result = column_a.fillna(column_b)
Run Code Online (Sandbox Code Playgroud)

结果是一列,column_a如果该列提供非空值,则从中获取每个值,否则从中获取值column_b.所以你combo1可以用:

df['first'].fillna(df['second']).fillna(df['third'])
Run Code Online (Sandbox Code Playgroud)

赠送:

0       A
1       C
2       B
3    None
4       A
Run Code Online (Sandbox Code Playgroud)

combo2可以通过以下方式生产:

(df['second']).fillna(df['third']).fillna(df['first'])
Run Code Online (Sandbox Code Playgroud)

返回新列:

0       C
1       C
2       B
3    None
4       B
Run Code Online (Sandbox Code Playgroud)

如果你想要一个有效的操作调用coalesce,它可以简单地fillna()从左到右组合列,然后返回结果:

def coalesce(df, column_names):
    i = iter(column_names)
    column_name = next(i)
    answer = df[column_name]
    for column_name in i:
        answer = answer.fillna(df[column_name])
    return answer

print coalesce(df, ['first', 'second', 'third'])
print coalesce(df, ['second', 'third', 'first'])
Run Code Online (Sandbox Code Playgroud)

这使:

0       A
1       C
2       B
3    None
4       A

0       C
1       C
2       B
3    None
4       B
Run Code Online (Sandbox Code Playgroud)


unu*_*tbu 0

您可以用来pd.isnull查找空值(在本例中)None

\n\n
In [169]: pd.isnull(df)\nOut[169]: \n   first second  third\n0  False  False  False\n1   True  False  False\n2   True   True  False\n3   True   True   True\n4  False   True  False\n
Run Code Online (Sandbox Code Playgroud)\n\n

然后使用np.argmin查找第一个非空值的索引。如果所有值都为 null,np.argmin则返回 0:

\n\n
In [186]: np.argmin(pd.isnull(df).values, axis=1)\nOut[186]: array([0, 1, 2, 0, 0])\n
Run Code Online (Sandbox Code Playgroud)\n\n

df然后您可以使用 NumPy 整数索引选择所需的值:

\n\n
In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)]\nOut[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)\n
Run Code Online (Sandbox Code Playgroud)\n\n
\n\n

例如,

\n\n
import pandas as pd\ndf = pd.DataFrame([{'third':'B','first':'A','second':'C'},\n                   {'third':'B','first':None,'second':'C'},\n                   {'third':'B','first':None,'second':None},                   \n                   {'third':None,'first':None,'second':None},\n                   {'third':'B','first':'A','second':None}])\n\nmask = pd.isnull(df).values\ndf['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]\norder = np.array([1,2,0])\nmask = mask[:, order]\ndf['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]\n
Run Code Online (Sandbox Code Playgroud)\n\n

产量

\n\n
  first second third combo1 combo2\n0     A      C     B      A      C\n1  None      C     B      C      C\n2  None   None     B      B      B\n3  None   None  None   None   None\n4     A   None     B      A      B\n
Run Code Online (Sandbox Code Playgroud)\n\n
\n\n

df3.apply(coalesce, ...)如果 DataFrame 有很多行,则使用 argmin会明显更快:

\n\n
df2 = pd.concat([df]*1000)\n\nIn [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)]\n1000 loops, best of 3: 617 \xc2\xb5s per loop\n\nIn [231]: %timeit df2.apply(coalesce, axis=1)\n10 loops, best of 3: 84.1 ms per loop\n
Run Code Online (Sandbox Code Playgroud)\n