Chr*_*xon 9 python dataframe pandas
简化的dfs:
df = pd.DataFrame(
{
"ID": [6, 2, 4],
"to ignore": ["foo", "whatever", "idk"],
"value": ["A", "B", "A"],
}
)
df2 = pd.DataFrame(
{
"ID_number": [1, 2, 3, 4, 5, 6],
"A": [0.91, 0.42, 0.85, 0.84, 0.81, 0.88],
"B": [0.11, 0.22, 0.45, 0.38, 0.01, 0.18],
}
)
ID to ignore value
0 6 foo A
1 2 whatever B
2 4 idk A
A B ID_number
0 0.91 0.11 1
1 0.42 0.22 2
2 0.85 0.45 3
3 0.84 0.38 4
4 0.81 0.01 5
5 0.88 0.18 6
Run Code Online (Sandbox Code Playgroud)
我想添加一列,df其中包含df['ID']todf2['ID_number']和df['value']todf2与中的值匹配的列的组合df[value](“A”或“B”)。
我们可以添加一列匹配值,其中给出了查找列名称df2,“A”:
df["NewCol"] = df["ID"].map(
df2.drop_duplicates("ID_number").set_index("ID_number")["A"]
)
Run Code Online (Sandbox Code Playgroud)
这使:
ID to ignore value NewCol
0 6 foo A 0.88
1 2 whatever B 0.42
2 4 idk A 0.84
Run Code Online (Sandbox Code Playgroud)
但这并没有给出 B 的值,因此在查找 'B' 时上面的值 '0.42' 应该是 '0.22'。
df["NewCol"] = df["ID"].map(
df2.drop_duplicates("ID_number").set_index("ID_number")[df["value"]]
)
Run Code Online (Sandbox Code Playgroud)
显然不起作用。我怎么能这样?
您可以ID_number在 中设置为索引df2,然后pd.Index.get_indexer在此处使用。
df2 = df2.set_index(\'ID_number\')\nr = df2.index.get_indexer(df[\'ID\'])\nc = df2.columns.get_indexer(df[\'value\'])\ndf[\'new_col\'] = df2.values[r, c]\ndf\n\n ID to ignore value new_col\n0 6 foo A 0.88\n1 2 whatever B 0.22\n2 4 idk A 0.84\nRun Code Online (Sandbox Code Playgroud)\n使用以下设置进行基准测试:
\n在 Ubuntu 20.04.1 LTS(focal)、Cpython3.8.5、Ipython shell(7.18.1)、pandas(1.1.4)、numpy(1.19.2) 上测试
\n设置
\ndf2 = pd.DataFrame(\n {\n "ID_number": np.arange(1, 1_000_000 + 1),\n "A": np.random.rand(1_000_000),\n "B": np.random.rand(1_000_000),\n }\n)\n\ndf = pd.DataFrame(\n {\n "ID": np.random.randint(1, 1_000_000, 50_000),\n "to ignore": ["anything"] * 50_000,\n "value": np.random.choice(["A", "B"], 50_000),\n }\n)\nRun Code Online (Sandbox Code Playgroud)\n结果:
\n@Vaishali\nIn [57]: %%timeit\n ...: mapper = df2.set_index(\'ID_number\').to_dict(\'index\')\n ...: df[\'NewCol\'] = df.apply(lambda x: mapper[x[\'ID\']][x[\'value\']], axis =\n ...: 1)\n ...: \n ...: \n2.09 s \xc2\xb1 68.2 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\n@Ch3steR\nIn [58]: %%timeit\n ...: t = df2.set_index(\'ID_number\')\n ...: r = t.index.get_indexer(df[\'ID\'])\n ...: c = t.columns.get_indexer(df[\'value\'])\n ...: df[\'new_col\'] = df2.values[r, c]\n ...: \n ...: \n49.7 ms \xc2\xb1 2.69 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\n@Mayank\nIn [59]: %%timeit\n ...: x = df2.set_index(\'ID_number\').stack()\n ...: y = df.set_index([\'ID\', \'value\'])\n ...: y[\'NewCol\'] = y.index.to_series().map(x.to_dict())\n ...: y.reset_index(inplace=True)\n ...: \n ...: \n3.41 s \xc2\xb1 226 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\n@Jezrael\nIn [60]: %%timeit\n ...: df11 = (df2.melt(\'ID_number\', value_name=\'NewCol\', var_name=\'value\')\n ...: .drop_duplicates([\'ID_number\',\'value\'])\n ...: .rename(columns={\'ID_number\':\'ID\'}))\n ...: df.merge(df11, on=[\'ID\',\'value\'], how=\'left\')\n ...: \n ...: \n693 ms \xc2\xb1 16.1 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
187 次 |
| 最近记录: |