Joh*_*Doe 3 python python-3.x pandas
I've a DataFrame with a Company column.
Company
-------------------------------
Tundra Corporation Art Limited
Desert Networks Incorporated
Mount Yellowhive Security Corp
Carter, Rath and Mueller Limited (USD/AC)
Barrows corporation /PACIFIC
Corporation, Mounted Security
Run Code Online (Sandbox Code Playgroud)
I've a dictionary with regexes to normalize the company entities.
(^|\s)corporation(\s|$); Corp
(^|\s)Limited(\s|$); LTD
(^|\s)Incorporated(\s|$); INC
...
Run Code Online (Sandbox Code Playgroud)
I need to normalize only the last occurrence. This is my desired output.
Company
-------------------------------
Tundra Corporation Art LTD
Desert Networks INC
Mount Yellowhive Security Corp
Carter, Rath and Mueller LTD (USD/AC)
Barrows Corp /PACIFIC
Corp, Mounted Security
Run Code Online (Sandbox Code Playgroud)
(Only normalize Limited and not Corporation for : Tundra Corporation Art Limited)
My code:
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
Run Code Online (Sandbox Code Playgroud)
Is it possible to only change the last occurrence of an entity (do i need to change my regex)?
Change (\s|$) to ($) for match end of strings:
entity_dict = {'(^|\s)corporation($)': ' Corp',
'(^|\s)Limited($)': ' LTD',
'(^|\s)Incorporated($)': ' INC'}
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
Run Code Online (Sandbox Code Playgroud)
编辑:您可以简化无正则表达式的字典,然后创建小写字典以供可能使用Series.str.findall,获取索引的最后一个值,str[-1]并Series.map通过小写字典,最后在列表中替换:
entity_dict = {'corporation': 'Corp',
'Limited': 'LTD',
'Incorporated': 'INC'}
lower = {k.lower():v for k, v in entity_dict.items()}
s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('')
s2 = s1.str.lower().map(lower).fillna('')
df['Company'] = [a.replace(b, c) for a, b, c in zip(df['Company'], s1, s2)]
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
3 Carter, Rath and Mueller LTD (USD/AC)
4 Barrows Corp /PACIFIC
5 Corp, Mounted Security
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
77 次 |
| 最近记录: |