将列表中找到的 ID 添加到 Pandas 数据框中的新列

Question

将列表中找到的 ID 添加到 Pandas 数据框中的新列

MDR*_*MDR 12 python dataframe python-3.x pandas

假设我有以下数据框（一列整数和一列整数列表）...

      ID                   Found_IDs
0  12345        [15443, 15533, 3433]
1  15533  [2234, 16608, 12002, 7654]
2   6789      [43322, 876544, 36789]

Run Code Online (Sandbox Code Playgroud)

还有一个单独的ID列表......

bad_ids = [15533, 876544, 36789, 11111]

Run Code Online (Sandbox Code Playgroud)

鉴于此，并忽略df['ID']列和任何索引，我想查看bad_ids列中是否提到了列表中的任何 ID df['Found_IDs']。我到目前为止的代码是：

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

Run Code Online (Sandbox Code Playgroud)

这有效，但bad_ids前提是列表比数据框长，并且对于真实数据集，bad_ids列表将比数据框短很多。如果我将bad_ids列表设置为只有两个元素......

bad_ids = [15533, 876544]

Run Code Online (Sandbox Code Playgroud)

我收到了一个非常流行的错误（我已经阅读了许多带有相同错误的问题）...

ValueError: Length of values does not match length of index

Run Code Online (Sandbox Code Playgroud)

我尝试将列表转换为系列（错误没有变化）。我还尝试False在执行理解行之前添加新列并将所有值设置为（再次没有更改错误）。

两个问题：

如何让我的代码（如下）适用于比数据框短的列表？
我如何获得将找到的实际 ID 写回df['bad_id']列的代码（比 True/False 更有用）？

的预期输出bad_ids = [15533, 876544]：

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    True
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    True

Run Code Online (Sandbox Code Playgroud)

bad_ids = [15533, 876544]（将 ID 写入一个或多个新列）的理想输出：

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    15533
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    876544

Run Code Online (Sandbox Code Playgroud)

代码：

import pandas as pd

result_list = [[12345,[15443,15533,3433]],
        [15533,[2234,16608,12002,7654]],
        [6789,[43322,876544,36789]]]

df = pd.DataFrame(result_list,columns=['ID','Found_IDs'])

# works if list has four elements
# bad_ids = [15533, 876544, 36789, 11111]

# fails if list has two elements (less elements than the dataframe)
# ValueError: Length of values does not match length of index
bad_ids = [15533, 876544]

# coverting to Series doesn't change things
# bad_ids = pd.Series(bad_ids)
# print(type(bad_ids))

# setting up a new column of false values doesn't change things
# df['bad_id'] = False

print(df)

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

print(bad_ids)

print(df)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Erf*_*fan 9

使用np.intersect1d得到两个列表的交叉：

df['bad_id'] = df['Found_IDs'].apply(lambda x: np.intersect1d(x, bad_ids))

      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

Run Code Online (Sandbox Code Playgroud)

或者只使用香草 python 使用 intersect of sets：

bad_ids_set = set(bad_ids)
df['Found_IDs'].apply(lambda x: list(set(x) & bad_ids_set))

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，10 月前
查看次数：	2202 次
最近记录：	5 年，10 月前