将pandas字符串列与缺少的值组合在一起

CoM*_*tel 5 python pandas

我需要在pandas数据帧的2列或更多列中连接字符串.

我找到了这个答案,如果你没有任何缺失值,它可以正常工作.不幸的是,我有,这导致像"ValueA; None"这样的东西,这不是很干净.

示例数据:

col_A  | col_B
------ | ------
val_A  | val_B 
None   | val_B 
val_A  | None 
None   | None
Run Code Online (Sandbox Code Playgroud)

我需要这个结果:

col_merge
---------
val_A;val_B
val_B
val_A
None
Run Code Online (Sandbox Code Playgroud)

jez*_*ael 8

你可以用applyif-else:

df = df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1)
print (df)
0    val_A;val_B
1          val_B
2          val_A
3           None
dtype: object
Run Code Online (Sandbox Code Playgroud)

为了更快的解决方案可以使用:

#add separator and replace NaN to empty space
#convert to lists
arr = df.add('; ').fillna('').values.tolist()
#list comprehension, replace empty spaces to NaN
s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
#replace NaN to None
s = s.where(s.notnull(), None)
print (s)
0    val_A;val_B
1          val_B
2          val_A
3           None
dtype: object
Run Code Online (Sandbox Code Playgroud)
#40000 rows
df = pd.concat([df]*10000).reset_index(drop=True)

In [70]: %%timeit
    ...: arr = df.add('; ').fillna('').values.tolist()
    ...: s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
    ...: s.where(s.notnull(), None)
    ...: 
10 loops, best of 3: 74 ms per loop


In [71]: %%timeit
    ...: df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1)
    ...: 
1 loop, best of 3: 12.7 s per loop

#another solution, but slowier a bit
In [72]: %%timeit
     ...: arr = df.add('; ').fillna('').values  
     ...: s = [''.join(x).strip('; ') for x in arr]
     ...: pd.Series([y if y != '' else None for y in s])
     ...: 
     ...: 
10 loops, best of 3: 119 ms per loop
Run Code Online (Sandbox Code Playgroud)