注意:为了简单起见,我正在使用一个玩具示例,因为复制/粘贴数据帧很难在堆栈溢出(请告诉我是否有一个简单的方法来执行此操作).
有没有办法将一个数据框中的值合并到另一个数据框而不获取_X,_Y列?我希望一列上的值替换另一列的所有零值.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
Run Code Online (Sandbox Code Playgroud)
在上一篇文章中,我尝试了combine_First和dropna(),但这些都没有完成任务.
我想用df2中的值替换df1中的零.此外,我希望根据df2更改具有相同名称的所有行.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
Run Code Online (Sandbox Code Playgroud)
我现有的解决方案执行以下操作:我根据df2中存在的名称进行子集化,然后使用正确的值替换这些值.但是,我想要一个不那么黑客的方法来做到这一点.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)
Run Code Online (Sandbox Code Playgroud)
Jer*_*y Z 63
KSD 的回答会引发错误:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
Run Code Online (Sandbox Code Playgroud)
而 EdChum 的回答会给我们错误的结果:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Run Code Online (Sandbox Code Playgroud)
好吧,只有当“名称”列中的值是唯一的并且在两个数据框中都排序时,它才会安全地工作。
这是我的回答:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Run Code Online (Sandbox Code Playgroud)
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
Run Code Online (Sandbox Code Playgroud)
有关更新的更多指南。. 在“更新”之前,需要设置索引的两个数据框的列名不必相同。您可以尝试“Name1”和“Name2”。此外,即使 df2 中的其他不必要的行也不会更新 df1,它也能工作。换句话说,df2 不需要是 df1 的超集。
例子:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
Run Code Online (Sandbox Code Playgroud)
结果:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0
Run Code Online (Sandbox Code Playgroud)
EdC*_*ica 24
使用boolean mask from isin
来过滤df并从rhs df中分配所需的行值:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
Run Code Online (Sandbox Code Playgroud)
小智 8
在[27]中:这是正确的.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
Run Code Online (Sandbox Code Playgroud)
[4行x 4列]
只有当df1中的所有行都存在于df中时,上述操作才有效.换句话说,df应该是df1的超级集合
如果你在df1中有一些不匹配的行到df,你应该按照下面的说法进行操作
换句话说,df不是df1的超集:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
Run Code Online (Sandbox Code Playgroud)
df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
29902 次 |
最近记录: |