Ger*_*rry 2 string dataframe python-3.x pandas
I have a very large (15 million rows) pandas dataframe df
with sample being given below:
import pandas as pd
df = pd.DataFrame({'a':['ar', 're' ,'rw', 'rew', 'are'], 'b':['gh', 're', 'ww', 'rew', 'all'], 'c':['ar', 're', 'ww', '', 'different']})
df
a b c
0 ar gh ar
1 re re re
2 rw ww ww
3 rew rew
4 are all different
Run Code Online (Sandbox Code Playgroud)
I want to add another column d
which has the most common value from the other 3 columns (could be 4 or 5 columns in actual dataframe), viz., a, b, c
in this case. So output will look like df
as follows:
a b c d
0 ar gh ar ar
1 re re re re
2 rw ww ww ww
3 rew rew rew
4 are all different
Run Code Online (Sandbox Code Playgroud)
What is the most efficient way to achieve it without using lambda
function that can be pretty slow (45 mins to an hour) given the size of df
is 15 million rows.
IIUC, you need:
m = df.mode(axis=1).iloc[:,0]
df['d'] = m.mask(df.nunique(1).eq(df.shape[1])) #for all are different condition
Run Code Online (Sandbox Code Playgroud)
For a faster alternative:
df['d'] = np.where(df.nunique(1).eq(df.shape[1]),np.nan,df.mode(axis=1).iloc[:,0])
Run Code Online (Sandbox Code Playgroud)
a b c d
0 ar gh ar ar
1 re re re re
2 rw ww ww ww
3 rew rew rew
4 are all different NaN
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
68 次 |
最近记录: |