Adding new column with most popular string value in each row in Pandas DataFrame

Question

Adding new column with most popular string value in each row in Pandas DataFrame

Ger*_*rry 2 string dataframe python-3.x pandas

I have a very large (15 million rows) pandas dataframe df with sample being given below:

import pandas as pd
df = pd.DataFrame({'a':['ar', 're' ,'rw', 'rew', 'are'], 'b':['gh', 're', 'ww', 'rew', 'all'], 'c':['ar', 're', 'ww', '', 'different']})
df
     a    b          c
0   ar   gh         ar
1   re   re         re
2   rw   ww         ww
3  rew  rew         
4  are  all  different

Run Code Online (Sandbox Code Playgroud)

I want to add another column d which has the most common value from the other 3 columns (could be 4 or 5 columns in actual dataframe), viz., a, b, c in this case. So output will look like df as follows:

     a    b          c     d
0   ar   gh         ar    ar
1   re   re         re    re
2   rw   ww         ww    ww
3  rew  rew              rew
4  are  all  different

Run Code Online (Sandbox Code Playgroud)

What is the most efficient way to achieve it without using lambda function that can be pretty slow (45 mins to an hour) given the size of df is 15 million rows.

Answer 1

ank*_*_91 5

IIUC, you need:

m = df.mode(axis=1).iloc[:,0]
df['d'] = m.mask(df.nunique(1).eq(df.shape[1])) #for all are different condition

Run Code Online (Sandbox Code Playgroud)

For a faster alternative:

df['d'] = np.where(df.nunique(1).eq(df.shape[1]),np.nan,df.mode(axis=1).iloc[:,0])

Run Code Online (Sandbox Code Playgroud)

     a    b          c    d
0   ar   gh         ar   ar
1   re   re         re   re
2   rw   ww         ww   ww
3  rew  rew             rew
4  are  all  different  NaN

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，2 月前
查看次数：	68 次
最近记录：	5 年，2 月前