如何根据其他几列填充一列?

Cle*_*leb 4 python dataframe pandas

我有两个这样的数据框:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        'A': list('aaabdcde'),
        'B': list('smnipiuy'),
        'C': list('zzzqqwll')
    }
)

df2 = pd.DataFrame(
    {
        'mapcol': list('abpppozl')
    }
)

   A  B  C
0  a  s  z
1  a  m  z
2  a  n  z
3  b  i  q
4  d  p  q
5  c  i  w
6  d  u  l
7  e  y  l

  mapcol
0      a
1      b
2      p
3      p
4      p
5      o
6      z
7      l
Run Code Online (Sandbox Code Playgroud)

现在我想创建一个额外的列中df1应填入值从列到来ABC分别,这取决于他们的价值观中可以找到df2['mapcol']。如果可以在多列中找到一行中的值,则应首先使用 from A, thenB和 then C,因此我的预期结果如下所示:

   A  B  C final
0  a  s  z     a  # <- values can be found in A and C, but A is preferred
1  a  m  z     a  # <- values can be found in A and C, but A is preferred
2  a  n  z     a  # <- values can be found in A and C, but A is preferred
3  b  i  q     b  # <- value can be found in A 
4  d  p  q     p  # <- value can be found in B
5  c  i  w   NaN  # none of the values can be mapped
6  d  u  l     l  # value can be found in C
7  e  y  l     l  # value can be found in C
Run Code Online (Sandbox Code Playgroud)

一个简单的实现可能如下所示(以首选顺序final使用迭代填充列fillna):

preferred_order = ['A', 'B', 'C']

df1['final'] = np.nan

for col in preferred_order:
    df1['final'] = df1['final'].fillna(df1[col][df1[col].isin(df2['mapcol'])])
Run Code Online (Sandbox Code Playgroud)

这给出了预期的结果。

有没有人看到避免循环的解决方案?

Shu*_*rma 5

用:

order =  ['A', 'B', 'C'] # order of columns

d = df1[order].isin(df2['mapcol'].tolist()).loc[lambda x: x.any(axis=1)].idxmax(axis=1)
df1.loc[d.index, 'final'] = df1.lookup(d.index, d)
Run Code Online (Sandbox Code Playgroud)

细节:

用途DataFrame.isin和使用布尔屏蔽与过滤行DataFrame.any一起axis=1,然后用DataFrame.idxmax一起axis=1获得沿最大值相关的列名的名字axis=1

print(d)
0    A
1    A
2    A
3    A
4    B
6    C
7    C
dtype: object
Run Code Online (Sandbox Code Playgroud)

使用DataFrame.lookup中查找值df1对应indexcolumnsd和分配此值列final

print(df1)
   A  B  C final
0  a  s  z     a
1  a  m  z     a
2  a  n  z     a
3  b  i  q     b
4  d  p  q     p
5  c  i  w   NaN
6  d  u  l     l
7  e  y  l     l
Run Code Online (Sandbox Code Playgroud)


Ben*_*n.T 5

您可以在完整的数据帧上使用where和来屏蔽不在 中的值,然后使用和沿着列重新排序,保留第一列isindf1df2preferred_orderbfilliloc

preferred_order = ['A', 'B', 'C']

df1['final'] = (df1.where(df1.isin(df2['mapcol'].to_numpy()))
                   [preferred_order]
                   .bfill(axis=1)
                   .iloc[:, 0]
               )
print (df1)
   A  B  C final
0  a  s  z     a
1  a  m  z     a
2  a  n  z     a
3  b  i  q     b
4  d  p  q     p
5  c  i  w   NaN
6  d  u  l     l
7  e  y  l     l
Run Code Online (Sandbox Code Playgroud)