将列表中找到的 ID 添加到 Pandas 数据框中的新列

MDR*_*MDR 12 python dataframe python-3.x pandas

假设我有以下数据框(一列整数和一列整数列表)...

      ID                   Found_IDs
0  12345        [15443, 15533, 3433]
1  15533  [2234, 16608, 12002, 7654]
2   6789      [43322, 876544, 36789]
Run Code Online (Sandbox Code Playgroud)

还有一个单独的ID列表......

bad_ids = [15533, 876544, 36789, 11111]
Run Code Online (Sandbox Code Playgroud)

鉴于此,并忽略df['ID']列和任何索引,我想查看bad_ids列中是否提到了列表中的任何 ID df['Found_IDs']。我到目前为止的代码是:

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]
Run Code Online (Sandbox Code Playgroud)

这有效,但bad_ids前提是列表比数据框长,并且对于真实数据集,bad_ids列表将比数据框短很多。如果我将bad_ids列表设置为只有两个元素......

bad_ids = [15533, 876544]
Run Code Online (Sandbox Code Playgroud)

我收到了一个非常流行的错误(我已经阅读了许多带有相同错误的问题)...

ValueError: Length of values does not match length of index
Run Code Online (Sandbox Code Playgroud)

我尝试将列表转换为系列(错误没有变化)。我还尝试False在执行理解行之前添加新列并将所有值设置为(再次没有更改错误)。

两个问题:

  1. 如何让我的代码(如下)适用于比数据框短的列表?
  2. 我如何获得将找到的实际 ID 写回df['bad_id']列的代码(比 True/False 更有用)?

的预期输出bad_ids = [15533, 876544]

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    True
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    True
Run Code Online (Sandbox Code Playgroud)

bad_ids = [15533, 876544](将 ID 写入一个或多个新列)的理想输出:

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    15533
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    876544
Run Code Online (Sandbox Code Playgroud)

代码:

import pandas as pd

result_list = [[12345,[15443,15533,3433]],
        [15533,[2234,16608,12002,7654]],
        [6789,[43322,876544,36789]]]

df = pd.DataFrame(result_list,columns=['ID','Found_IDs'])

# works if list has four elements
# bad_ids = [15533, 876544, 36789, 11111]

# fails if list has two elements (less elements than the dataframe)
# ValueError: Length of values does not match length of index
bad_ids = [15533, 876544]

# coverting to Series doesn't change things
# bad_ids = pd.Series(bad_ids)
# print(type(bad_ids))

# setting up a new column of false values doesn't change things
# df['bad_id'] = False

print(df)

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

print(bad_ids)

print(df)
Run Code Online (Sandbox Code Playgroud)

Erf*_*fan 9

使用np.intersect1d得到两个列表的交叉:

df['bad_id'] = df['Found_IDs'].apply(lambda x: np.intersect1d(x, bad_ids))

      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]
Run Code Online (Sandbox Code Playgroud)

或者只使用香草 python 使用 intersect of sets

bad_ids_set = set(bad_ids)
df['Found_IDs'].apply(lambda x: list(set(x) & bad_ids_set))
Run Code Online (Sandbox Code Playgroud)