在 Pandas 中，如何识别具有共同值的记录并替换其中一个的值以匹配另一个？

Question

在 Pandas 中，如何识别具有共同值的记录并替换其中一个的值以匹配另一个？

我有一个包含三列的熊猫数据框：

a          b          c
Donaldson  Minnesota  2020
Ozuna      Atlanta    2020
Betts      Boston     2019
Donaldson  Atlanta    2019
Ozuna      St. Louis  2019
Torres     New York   2019

Run Code Online (Sandbox Code Playgroud)

我想识别具有多个列 c 值的所有列名称，然后将所有列 b 实例替换为数据框中的第一个值，如下所示：

a          b          c
Donaldson  Minnesota  2020
Ozuna      Atlanta    2020
Betts      Boston     2019
Donaldson  Minnesota  2019
Ozuna      Atlanta    2019
Torres     New York   2019

Run Code Online (Sandbox Code Playgroud)

这绝对是低效的，但这是我迄今为止尝试过的：

# get a df of just names and cities and deduplicate

df_names = df[['a','b']].drop_duplicates()


# find any multiple column b values and put them in a list

a_matches = pd.Dataframe(df_names.groupby('a')['b'].nunique())
multi_b = a_matches.index[a_matches['b'] > 1].tolist()

Run Code Online (Sandbox Code Playgroud)

这给了我 ['Donaldson','Ozuna']，但现在我被卡住了。我想不出一个好方法来为 c 中的相应值生成替换字典。我认为必须有一种更优雅的方式来解决这个问题。

Answer 1

ank*_*_91 5

IIUC，您可以尝试groupby+transform使用np.where：

g = df.groupby('a')
c = g['c'].transform('nunique').gt(1) # column a names that have >1 column c value
df['b'] = np.where(c,g['b'].transform('first'),df['b'])
# for a new df: new = df.assign(b=np.where(c,g['b'].transform('first'),df['b']))

Run Code Online (Sandbox Code Playgroud)

print(df)

         a          b     c
0  Donaldson  Minnesota  2020
1      Ozuna    Atlanta  2020
2      Betts     Boston  2019
3  Donaldson  Minnesota  2019
4      Ozuna    Atlanta  2019
5     Torres   New York  2019

Run Code Online (Sandbox Code Playgroud)

对于@ALloz 正确指出的给定示例，您可以使用：

df['b'] = df.groupby('a')['b'].transform('first')
print(df)

Run Code Online (Sandbox Code Playgroud)

           a          b     c
0  Donaldson  Minnesota  2020
1      Ozuna    Atlanta  2020
2      Betts     Boston  2019
3  Donaldson  Minnesota  2019
4      Ozuna    Atlanta  2019
5     Torres   New York  2019

Run Code Online (Sandbox Code Playgroud)

`np.where` 比 `series.mask` 或 `series.where` 快得多吗？ (2认同)

归档时间：	5 年，11 月前
查看次数：	38 次
最近记录：	5 年，11 月前