根据列表更新数据框值

Question

根据列表更新数据框值

我有一个数据框，基于名为“originator”的列中的字符串，我想检查该字符串是否包含位于另一个列表中的单词。如果该字符串包含位于所述列表中的单词，则将列 originator_prediction 更新为“org”。

有一个更好的方法吗？我是按照以下方式做的，但速度很慢。

for row in df['ORIGINATOR'][1:]:
    string = str(row)
    splits = string.split()
    for word in splits:
        if word in COMMON_ORG_UNIGRAMS_LIST:
            df['ORGINATOR_PREDICTION'] = 'Org'
        else:
            continue

Run Code Online (Sandbox Code Playgroud)


df  = pd.DataFrame({'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
        'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

Run Code Online (Sandbox Code Playgroud)

具体来说，如果您查看我们的数据框“APPLE INC”中的第 2 行，应该有一个 originator_prediction = 'ORG' 而不是 person。

原因是，我们遍历了常见的 org unigrams 列表，里面有 INC 这个词。

Answer 1

Sco*_*ton 2

尝试使用.str, 字符串访问器和contains方法。我们可以使用join字符串列表创建一个正则表达式：

df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)), 'ORGINATOR_PREDICTION'] = 'Org'

Run Code Online (Sandbox Code Playgroud)

输出：

   ORIGINATOR ORGINATOR_PREDICTION
0    JOHN DOE               Person
1   APPLE INC                  Org
2  MIKE LOWRY               Person

Run Code Online (Sandbox Code Playgroud)

完整代码：

df  = pd.DataFrame({'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
        'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)),'ORGINATOR_PREDICTION'] = 'Org'

print(df)

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，3 月前
查看次数：	69 次
最近记录：	5 年，3 月前