Alv*_*rez 2 python pandas regex-lookarounds
我正在寻找一种解决方案来提取没有其他名称或数字的名称。
我的目标是将不在括号中、不带空格和数字的子字符串提取到新列中。
例如:
String New string
Bolivia (Plurinational State of) Bolivia
United States of America20 United States of America
Run Code Online (Sandbox Code Playgroud)
数据如下:
**Country** **Energy Supply**
Antigua and Barbuda 8000000
Bolivia (Plurinational State of) 50000
Iran (Islamic Republic of) 20000
Sint Maarten (Dutch part) 58000
United States of America20 65000
China, Macao Special AdministrativeRegion4 52000
.....more cases.... ....more cases....
Run Code Online (Sandbox Code Playgroud)
我的代码如下所示:
df['newcontry']=df['Country'].str.extract(r'(\w*\s)')
Run Code Online (Sandbox Code Playgroud)
并返回类似这样的内容:
**Country** **Energy Supply** newcontry
Antigua and Barbuda 8000000 Antigua
Bolivia (Plurinational State of) 50000 Bolivia
Iran (Islamic Republic of) 20000 Iran
Sint Maarten (Dutch part) 58000 Sint
United States of America20 65000 United
China, Macao Special AdministrativeRegion4 52000 China
Run Code Online (Sandbox Code Playgroud)
我可以改变什么来解决这个错误?
假设您只需要字符串的前导块,您可以使用\d和\(:之间的交替组r"^(.+?) ?(?:\d|\(|$)"和惰性(.+?)来提取您感兴趣的块。
>>> df = pd.DataFrame({"Country": ["Bolivia (Plurinational State of)", "United States of America20", "Antigua and Barbuda"]})
>>> df
Country
0 Bolivia (Plurinational State of)
1 United States of America20
2 Antigua and Barbuda
>>> df["Country"].str.extract(r"^(.+?) ?(?:\d|\(|$)")
0
0 Bolivia
1 United States of America
2 Antigua and Barbuda
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
685 次 |
| 最近记录: |