用第一个唯一值替换其他重复行列并创建查找

Viv*_*gan 5 python pandas

这是数据——

Account_Number  Dummy_Account
1050080713252   ACC0000000000001
1050223213427   ACC0000000000002
1050080713252   ACC0000000169532
1105113502309   ACC0000000123005
1100043521537   ACC0000000000004
1100045301840   ACC0000000000005
1105113502309   ACC0000000000040
Run Code Online (Sandbox Code Playgroud)

中的行1,3具有重复值Account_Number。行也是如此4,7。我需要Account_NumberDummy_Account. 所以对于1050080713252,两行1,3应该有相同的虚拟值ACC0000000000001。但不是直接替换,我想保留原始映射。

我的预期输出是 -

Account_Number_Map      Dummy_Account_Original
ACC0000000000001    ACC0000000000001
ACC0000000000002    ACC0000000000002
ACC0000000000001    ACC0000000169532
ACC0000000123005    ACC0000000123005
ACC0000000000004    ACC0000000000004
ACC0000000000005    ACC0000000000005
ACC0000000123005    ACC0000000000040
Run Code Online (Sandbox Code Playgroud)

由于ACC0000000169532是重复的Dummy_Accountwrt Account_Number,我想创建一个查找来替换它ACC0000000000001

我试过的

我开始创建一个dict这样的 -

maps = dict(zip(df.Dummy_Account, df.Account_Number))
Run Code Online (Sandbox Code Playgroud)

我想创建一个dict将原始Dummy_Account值作为key和新Dummy_Account值作为的value 但我有点迷茫。我的数据集很大,所以我也在寻找优化的解决方案。

cs9*_*s95 3

使用drop_duplicates,创建一个您将传递给的系列map

\n\n
m = df.drop_duplicates('Account_Number', keep='first')\\\n      .set_index('Account_Number')\\\n      .Dummy_Account\n\ndf.Account_Number = df.Account_Number.map(m)\n
Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

df\n\n     Account_Number     Dummy_Account\n0  ACC0000000000001  ACC0000000000001\n1  ACC0000000000002  ACC0000000000002\n2  ACC0000000000001  ACC0000000169532\n3  ACC0000000123005  ACC0000000123005\n4  ACC0000000000004  ACC0000000000004\n5  ACC0000000000005  ACC0000000000005\n6  ACC0000000123005  ACC0000000000040\n
Run Code Online (Sandbox Code Playgroud)\n\n
\n\n

时间安排

\n\n
df = pd.concat([df] * 1000000, ignore_index=True)\n
Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

# jezrael's solution\n\n%%timeit\nv = df.sort_values('Account_Number')\nv['Account_Number'] = v['Dummy_Account'].mask(v.duplicated('Account_Number')).ffill()\nv.sort_index()\n\n315 ms \xc2\xb1 1.65 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

# in this post\n\n%%timeit\nm = df.drop_duplicates('Account_Number', keep='first')\\\n      .set_index('Account_Number')\\\n      .Dummy_Account\n\ndf.Account_Number.map(m)\n\n163 ms \xc2\xb1 3.56 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n\n

请注意,性能将取决于您的实际数据。

\n