Blu*_*482 2 python pandas data-cleaning
我有一个数据框(df)(通常来自excel文件),前9行是这样的:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP
Run Code Online (Sandbox Code Playgroud)
我想将此数据帧(例如前9行)转换为此:
Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00 OC/OER/OPA/PMS/ M WEBB,DATA CORP
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-FO/FLA-DO/FLA-CB/ G KAN,PALM POST
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/ M EGAN,BERNSTEIN LIEBHARD & LONDON LLP
Run Code Online (Sandbox Code Playgroud)
所以基本上:
有人可以帮我吗?这是我试图使其工作的代码:
for i, row in df.iterrows():
if pd.isnull(df.ix[i]['Control_#']):
df.ix[i-1]['Recd_Date/Due_Date'] = str(df.ix[i-1]['Recd_Date/Due_Date'])+'/'+str(df.ix[i]['Recd_Date/Due_Date'])
df.ix[i-1]['Subject'] = str(df.ix[i-1]['Subject'])+' '+str(df.ix[i]['Subject'])
if str(df.ix[i-1]['Action_Office'])[-1] == '-':
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+str(df.ix[i]['Action_Office'])
else:
df.ix[i-1]['Action_Office'] = str(df.ix[i-1]['Action_Office'])+','+str(df.ix[i]['Action_Office'])
if pd.isnull(df.ix[i-1]['Signature/Requester']):
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+str(df.ix[i]['Signature/Requester'])
elif str(df.ix[i-1]['Signature/Requester'])[-1] == '&':
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+' '+str(df.ix[i]['Signature/Requester'])
else:
df.ix[i-1]['Signature/Requester'] = str(df.ix[i-1]['Signature/Requester'])+','+str(df.ix[i]['Signature/Requester'])
df.drop(df.index[i])
Run Code Online (Sandbox Code Playgroud)
drop()为什么不起作用?我正在尝试删除当前行(如果其['Control_#']为空),则可以将下一行(其['Control_#']为空)添加到上一行(其['Control_#']为NOT null)迭代..
非常感激!!
我认为您需要将行分组在一起,然后合并列值。棘手的部分是找到一种以所需方式将行分组在一起的方法。这是我的解决方案...
由于您的组取决于行中的序列,因此我在方法中使用了静态变量来将每一行标记为特定组
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
Run Code Online (Sandbox Code Playgroud)
此方法与Control系列一起应用,将索引分为几组,然后用于拆分数据框以允许您合并行
#groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
Run Code Online (Sandbox Code Playgroud)
那实际上是唯一棘手的部分,您只需对每个组应用一个函数即可合并行,从而为您提供所需的输出
def rolling_group(val):
if pd.notnull(val): rolling_group.group +=1 #pd.notnull is signal to switch group
return rolling_group.group
rolling_group.group = 0 #static variable
def joinFunc(g,column):
col =g[column]
joiner = "/" if column == "Action" else ","
s = joiner.join([str(each) for each in col if pd.notnull(each)])
s = re.sub("(?<=&)"+joiner," ",s) #joiner = " "
s = re.sub("(?<=-)"+joiner,"",s) #joiner = ""
s = re.sub(joiner*2,joiner,s) #fixes double joiner condition
return s
Run Code Online (Sandbox Code Playgroud)
在上方#edit-str(每个)-转换为字符串...在正则表达式上方进行编辑以清理连接字符串连接
if __name__ == "__main__":
df = """ Control Recd_Date/Due_Date Action Signature/Requester
0 2000-1703 2000-01-31 00:00:00 OC/OER/OPA/PMS/ M WEBB
1 NaN 2000-02-29 00:00:00 NaN DATA CORP
2 2000-1776 2000-01-02 00:00:00 OC/ORA/OE/DCP/ G KAN
3 NaN 2000-01-03 00:00:00 OC/ORA/ORO/PNC/ PALM POST
4 NaN NaN FDA/OGROP/ORA/SE-FO/FLA- NaN
5 NaN NaN DO/FLA-CB/ NaN
6 2000-1983 2000-02-02 00:00:00 FDA/OGROP/ORA/CE-FO/CHI- M EGAN
7 NaN 2000-02-03 00:00:00 DO/CHI-CB/ BERNSTEIN LIEBHARD &
8 NaN NaN NaN LONDON LLP"""
df = pd.read_csv(StringIO.StringIO(df),sep = "\s\s+",engine='python')
groups = df.groupby(df['Control'].apply(rolling_group),as_index=False)
groupFunct = lambda g: pd.Series([joinFunc(g,col) for col in g.columns],index=g.columns)
print groups.apply(groupFunct)
Run Code Online (Sandbox Code Playgroud)
输出
Control Recd_Date/Due_Date \
0 2000-1703 2000-01-31 00:00:00,2000-02-29 00:00:00
1 2000-1776 2000-01-02 00:00:00,2000-01-03 00:00:00
2 2000-1983 2000-02-02 00:00:00,2000-02-03 00:00:00
Action \
0 OC/OER/OPA/PMS/
1 OC/ORA/OE/DCP/OC/ORA/ORO/PNC/FDA/OGROP/ORA/SE-...
2 FDA/OGROP/ORA/CE-FO/CHI-DO/CHI-CB/
Signature/Requester
0 M WEBB,DATA CORP
1 G KAN,PALM POST
2 M EGAN,BERNSTEIN LIEBHARD & LONDON LLP
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6647 次 |
| 最近记录: |