假设我有以下Pandas DataFrame:
a b
0 NAN BABA UN EQUITY
1 NAN 2018
2 NAN 2017
3 NAN 2016
4 NAN NAN
5 NAN 700 HK EQUITY
6 NAN 2018
7 NAN 2017
8 NAN 2016
9 NAN NAN
Run Code Online (Sandbox Code Playgroud)
我想检查列中的每个单元格b以查看它是否包含字符串EQUITY.如果是这样,我想替换列中的单元格a,下一行直到一行都是NAN前一个字符串,以获取编辑后的DataFrame,如下所示:
a b
0 NAN BABA UN EQUITY
1 BABA UN EQUITY 2018
2 BABA UN EQUITY 2017
3 BABA UN EQUITY 2016
4 NAN NAN
5 NAN 700 HK EQUITY
6 700 HK EQUITY 2018
7 700 HK EQUITY 2017
8 700 HK EQUITY 2016
9 NAN NAN
Run Code Online (Sandbox Code Playgroud)
我的实际DataFrame比上面的大得多,但格式类似.我对Pandas很新,但我想我可以通过sheet.loc在循环中使用和替换单元格值来找出文本替换部分
.
但是,我无法弄清楚如何检查单元格是否包含EQUITY.这似乎str.contains是我应该使用的,但我不清楚如何做到这一点.
谢谢!
unu*_*tbu 15
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': ['NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN'],
'b': ['BABA UN EQUITY', '2018', '2017', '2016', 'NAN', '700 HK EQUITY', '2018', '2017', '2016', 'NAN']})
# Make sure that all NaN values are `np.nan` not `'NAN'` (strings)
df = df.replace('NAN', np.nan)
mask = df['b'].str.contains(r'EQUITY', na=True)
df.loc[mask, 'a'] = df['b']
df['a'] = df['a'].ffill()
df.loc[mask, 'a'] = np.nan
Run Code Online (Sandbox Code Playgroud)
产量
a b
0 NaN BABA UN EQUITY
1 BABA UN EQUITY 2018
2 BABA UN EQUITY 2017
3 BABA UN EQUITY 2016
4 NaN NaN
5 NaN 700 HK EQUITY
6 700 HK EQUITY 2018
7 700 HK EQUITY 2017
8 700 HK EQUITY 2016
9 NaN NaN
Run Code Online (Sandbox Code Playgroud)
上面稍微有点棘手的是如何mask定义.请注意,str.contains
返回的Series不仅包含True和False值,还包含NaN:
In [114]: df['b'].str.contains(r'EQUITY')
Out[114]:
0 True
1 False
2 False
3 False
4 NaN
5 True
6 False
7 False
8 False
9 NaN
Name: b, dtype: object
Run Code Online (Sandbox Code Playgroud)
str.contains(..., na=True)用于使NaNs被视为True:
In [116]: df['b'].str.contains(r'EQUITY', na=True)
Out[116]:
0 True
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 True
Name: b, dtype: bool
Run Code Online (Sandbox Code Playgroud)
一旦你有了mask这个想法很简单:将值复制b到a任何地方mask为True:
df.loc[mask, 'a'] = df['b']
Run Code Online (Sandbox Code Playgroud)
向前填充NaN值a:
df['a'] = df['a'].ffill()
Run Code Online (Sandbox Code Playgroud)
a用NaN 替换值,只要mask为True:
df.loc[mask, 'a'] = np.nan
Run Code Online (Sandbox Code Playgroud)