添加 df 列以在另一个 df 中为索引值和动态源列查找匹配值?

Chr*_*xon 9 python dataframe pandas

简化的dfs:

df = pd.DataFrame(
    {
        "ID": [6, 2, 4],
        "to ignore": ["foo", "whatever", "idk"],
        "value": ["A", "B", "A"],
    }
)

df2 = pd.DataFrame(
    {
        "ID_number": [1, 2, 3, 4, 5, 6],
        "A": [0.91, 0.42, 0.85, 0.84, 0.81, 0.88],
        "B": [0.11, 0.22, 0.45, 0.38, 0.01, 0.18],
    }
)

   ID to ignore value
0   6       foo     A
1   2  whatever     B
2   4       idk     A

   A     B  ID_number
0  0.91  0.11          1
1  0.42  0.22          2
2  0.85  0.45          3
3  0.84  0.38          4
4  0.81  0.01          5
5  0.88  0.18          6
Run Code Online (Sandbox Code Playgroud)

我想添加一列,df其中包含df['ID']todf2['ID_number']df['value']todf2与中的值匹配的列的组合df[value](“A”或“B”)。

我们可以添加一列匹配值,其中给出了查找列名称df2,“A”:

df["NewCol"] = df["ID"].map(
    df2.drop_duplicates("ID_number").set_index("ID_number")["A"]
)
Run Code Online (Sandbox Code Playgroud)

这使:

   ID to ignore value  NewCol
0   6       foo     A    0.88
1   2  whatever     B    0.42
2   4       idk     A    0.84
Run Code Online (Sandbox Code Playgroud)

但这并没有给出 B 的值,因此在查找 'B' 时上面的值 '0.42' 应该是 '0.22'。

df["NewCol"] = df["ID"].map(
    df2.drop_duplicates("ID_number").set_index("ID_number")[df["value"]]
)
Run Code Online (Sandbox Code Playgroud)

显然不起作用。我怎么能这样?

Ch3*_*teR 4

您可以ID_number在 中设置为索引df2,然后pd.Index.get_indexer在此处使用。

\n
df2 = df2.set_index(\'ID_number\')\nr = df2.index.get_indexer(df[\'ID\'])\nc = df2.columns.get_indexer(df[\'value\'])\ndf[\'new_col\'] = df2.values[r, c]\ndf\n\n   ID to ignore value  new_col\n0   6       foo     A     0.88\n1   2  whatever     B     0.22\n2   4       idk     A     0.84\n
Run Code Online (Sandbox Code Playgroud)\n

时间

\n

使用以下设置进行基准测试:

\n

在 Ubuntu 20.04.1 LTS(focal)、Cpython3.8.5、Ipython shell(7.18.1)、pandas(1.1.4)、numpy(1.19.2) 上测试

\n

设置

\n
df2 = pd.DataFrame(\n    {\n        "ID_number": np.arange(1, 1_000_000 + 1),\n        "A": np.random.rand(1_000_000),\n        "B": np.random.rand(1_000_000),\n    }\n)\n\ndf = pd.DataFrame(\n    {\n        "ID": np.random.randint(1, 1_000_000, 50_000),\n        "to ignore": ["anything"] * 50_000,\n        "value": np.random.choice(["A", "B"], 50_000),\n    }\n)\n
Run Code Online (Sandbox Code Playgroud)\n

结果:

\n
@Vaishali\nIn [57]: %%timeit\n    ...: mapper = df2.set_index(\'ID_number\').to_dict(\'index\')\n    ...: df[\'NewCol\'] = df.apply(lambda x: mapper[x[\'ID\']][x[\'value\']], axis =\n    ...: 1)\n    ...: \n    ...: \n2.09 s \xc2\xb1 68.2 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\n@Ch3steR\nIn [58]: %%timeit\n    ...: t = df2.set_index(\'ID_number\')\n    ...: r = t.index.get_indexer(df[\'ID\'])\n    ...: c = t.columns.get_indexer(df[\'value\'])\n    ...: df[\'new_col\'] = df2.values[r, c]\n    ...: \n    ...: \n49.7 ms \xc2\xb1 2.69 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\n@Mayank\nIn [59]: %%timeit\n    ...: x = df2.set_index(\'ID_number\').stack()\n    ...: y = df.set_index([\'ID\', \'value\'])\n    ...: y[\'NewCol\'] = y.index.to_series().map(x.to_dict())\n    ...: y.reset_index(inplace=True)\n    ...: \n    ...: \n3.41 s \xc2\xb1 226 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\n@Jezrael\nIn [60]: %%timeit\n    ...: df11 = (df2.melt(\'ID_number\', value_name=\'NewCol\', var_name=\'value\')\n    ...:            .drop_duplicates([\'ID_number\',\'value\'])\n    ...:            .rename(columns={\'ID_number\':\'ID\'}))\n    ...: df.merge(df11, on=[\'ID\',\'value\'], how=\'left\')\n    ...: \n    ...: \n693 ms \xc2\xb1 16.1 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n