Replace only last occurrence of column value in DataFrame

Question

Replace only last occurrence of column value in DataFrame

I've a DataFrame with a Company column.

Company
-------------------------------                                                           
Tundra Corporation Art Limited
Desert Networks Incorporated
Mount Yellowhive Security Corp
Carter, Rath and Mueller Limited (USD/AC)
Barrows corporation /PACIFIC
Corporation, Mounted Security

Run Code Online (Sandbox Code Playgroud)

I've a dictionary with regexes to normalize the company entities.

(^|\s)corporation(\s|$); Corp 
(^|\s)Limited(\s|$); LTD 
(^|\s)Incorporated(\s|$); INC 
...

Run Code Online (Sandbox Code Playgroud)

I need to normalize only the last occurrence. This is my desired output.

Company
-------------------------------                                                           
Tundra Corporation Art LTD
Desert Networks INC
Mount Yellowhive Security Corp
Carter, Rath and Mueller LTD (USD/AC)
Barrows Corp /PACIFIC
Corp, Mounted Security

Run Code Online (Sandbox Code Playgroud)

(Only normalize Limited and not Corporation for : Tundra Corporation Art Limited)

My code:

for k, v in entity_dict.items():
    df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)

Run Code Online (Sandbox Code Playgroud)

Is it possible to only change the last occurrence of an entity (do i need to change my regex)?

Answer 1

jez*_*ael 5

Change (\s|$) to ($) for match end of strings:

entity_dict = {'(^|\s)corporation($)': ' Corp',
               '(^|\s)Limited($)': ' LTD',
               '(^|\s)Incorporated($)': ' INC'}

for k, v in entity_dict.items():
    df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)

print (df)
                          Company
0      Tundra Corporation Art LTD
1             Desert Networks INC
2  Mount Yellowhive Security Corp

Run Code Online (Sandbox Code Playgroud)

编辑：您可以简化无正则表达式的字典，然后创建小写字典以供可能使用Series.str.findall，获取索引的最后一个值，str[-1]并Series.map通过小写字典，最后在列表中替换：

entity_dict = {'corporation': 'Corp',
               'Limited': 'LTD',
               'Incorporated': 'INC'}

lower = {k.lower():v for k, v in entity_dict.items()}
s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('')
s2 = s1.str.lower().map(lower).fillna('')

df['Company'] = [a.replace(b, c) for a, b, c in zip(df['Company'], s1, s2)]
print (df)
                                 Company
0             Tundra Corporation Art LTD
1                    Desert Networks INC
2         Mount Yellowhive Security Corp
3  Carter, Rath and Mueller LTD (USD/AC)
4                  Barrows Corp /PACIFIC
5                 Corp, Mounted Security

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，10 月前
查看次数：	77 次
最近记录：	6 年，10 月前