这是数据——
Account_Number Dummy_Account
1050080713252 ACC0000000000001
1050223213427 ACC0000000000002
1050080713252 ACC0000000169532
1105113502309 ACC0000000123005
1100043521537 ACC0000000000004
1100045301840 ACC0000000000005
1105113502309 ACC0000000000040
Run Code Online (Sandbox Code Playgroud)
中的行1,3具有重复值Account_Number。行也是如此4,7。我需要Account_Number用Dummy_Account. 所以对于1050080713252,两行1,3应该有相同的虚拟值ACC0000000000001。但不是直接替换,我想保留原始映射。
我的预期输出是 -
Account_Number_Map Dummy_Account_Original
ACC0000000000001 ACC0000000000001
ACC0000000000002 ACC0000000000002
ACC0000000000001 ACC0000000169532
ACC0000000123005 ACC0000000123005
ACC0000000000004 ACC0000000000004
ACC0000000000005 ACC0000000000005
ACC0000000123005 ACC0000000000040
Run Code Online (Sandbox Code Playgroud)
由于ACC0000000169532是重复的Dummy_Accountwrt Account_Number,我想创建一个查找来替换它ACC0000000000001
我试过的
我开始创建一个dict这样的 -
maps = dict(zip(df.Dummy_Account, df.Account_Number))
Run Code Online (Sandbox Code Playgroud)
我想创建一个dict将原始Dummy_Account值作为key和新Dummy_Account值作为的value
但我有点迷茫。我的数据集很大,所以我也在寻找优化的解决方案。
使用drop_duplicates,创建一个您将传递给的系列map:
m = df.drop_duplicates('Account_Number', keep='first')\\\n .set_index('Account_Number')\\\n .Dummy_Account\n\ndf.Account_Number = df.Account_Number.map(m)\nRun Code Online (Sandbox Code Playgroud)\n\n\n\ndf\n\n Account_Number Dummy_Account\n0 ACC0000000000001 ACC0000000000001\n1 ACC0000000000002 ACC0000000000002\n2 ACC0000000000001 ACC0000000169532\n3 ACC0000000123005 ACC0000000123005\n4 ACC0000000000004 ACC0000000000004\n5 ACC0000000000005 ACC0000000000005\n6 ACC0000000123005 ACC0000000000040\nRun Code Online (Sandbox Code Playgroud)\n\n时间安排
\n\ndf = pd.concat([df] * 1000000, ignore_index=True)\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n# jezrael's solution\n\n%%timeit\nv = df.sort_values('Account_Number')\nv['Account_Number'] = v['Dummy_Account'].mask(v.duplicated('Account_Number')).ffill()\nv.sort_index()\n\n315 ms \xc2\xb1 1.65 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n# in this post\n\n%%timeit\nm = df.drop_duplicates('Account_Number', keep='first')\\\n .set_index('Account_Number')\\\n .Dummy_Account\n\ndf.Account_Number.map(m)\n\n163 ms \xc2\xb1 3.56 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\nRun Code Online (Sandbox Code Playgroud)\n\n请注意,性能将取决于您的实际数据。
\n| 归档时间: |
|
| 查看次数: |
714 次 |
| 最近记录: |