清理数据帧比loc更有效的方法

geo*_*asm 2 python dataframe pandas

我的代码看起来像:

import pandas as pd
df = pd.read_excel("Energy Indicators.xls", header=None, footer=None)
c_df = df.copy()
c_df = c_df.iloc[18:245, 2:]
c_df = c_df.rename(columns={2: 'Country', 3: 'Energy Supply', 4:'Energy Supply per Capita', 5:'% Renewable'})
c_df['Energy Supply'] = c_df['Energy Supply'].apply(lambda x: x*1000000)
c_df.loc[c_df['Country'] == 'Korea, Rep.'] = 'South Korea'
c_df.loc[c_df['Country'] == 'United States of America20'] = 'United States'
c_df.loc[c_df['Country'] == 'United Kingdom of Great Britain and Northern Ireland'] = 'United Kingdom'
c_df.loc[c_df['Country'] == 'China, Hong Kong Special Administrative Region'] = 'Hong Kong'
c_df.loc[c_df['Country'] == 'Venezuela (Bolivarian Republic of)'] = 'Venezuela'
c_df.loc[c_df['Country'] == 'Bolivia (Plurinational State of)'] = 'Bolivia'
c_df.loc[c_df['Country'] == 'Switzerland17'] = 'Switzerland'
c_df.loc[c_df['Country'] == 'Australia1'] = 'Australia'
c_df.loc[c_df['Country'] == 'China2'] = 'China'
c_df.loc[c_df['Country'] == 'Falkland Islands (Malvinas)'] = 'Bolivia'
c_df.loc[c_df['Country'] == 'Greenland7'] = 'Greenland'
c_df.loc[c_df['Country'] == 'Iran (Islamic Republic of'] = 'Iran'
c_df.loc[c_df['Country'] == 'Italy9'] = 'Italy'
c_df.loc[c_df['Country'] == 'Japan10'] = 'Japan'
c_df.loc[c_df['Country'] == 'Kuwait11'] = 'Kuwait'
c_df.loc[c_df['Country'] == 'Micronesia (Federal States of)'] = 'Micronesia'
c_df.loc[c_df['Country'] == 'Netherlands12'] = 'Netherlands'
c_df.loc[c_df['Country'] == 'Portugal13'] = 'Portugal'
c_df.loc[c_df['Country'] == 'Saudi Arabia14'] = 'Saudi Arabia'
c_df.loc[c_df['Country'] == 'Serbia15'] = 'Serbia'
c_df.loc[c_df['Country'] == 'Sint Maarteen (Dutch part)'] = 'Sint Marteen'
c_df.loc[c_df['Country'] == 'Spain16'] = 'Spain'
c_df.loc[c_df['Country'] == 'Ukraine18'] = 'Ukraine'
c_df.loc[c_df['Country'] == 'Denmark5'] = 'Denmark'
c_df.loc[c_df['Country'] == 'France6'] = 'France'
c_df.loc[c_df['Country'] == 'Indonesia8'] = 'Indonesia'
Run Code Online (Sandbox Code Playgroud)

我觉得必须有一种更简单的方法来更改名称中包含括号和数字的国家/地区的值.我可以使用哪种pandas方法在列中查找带括号数字的名称?isin

cs9*_*s95 5

您可以从括号中删除数字和文本开始.在此之后,对于需要非平凡替换的所有其他内容,声明一个地图并使用它来应用它pd.Series.replace.

mapper = {'Korea, Rep' : 'South Korea', 'Falkland Islands' : 'Bolivia', ...} 

df['Country'] = (
    df['Country'].str.replace(r'\d+|\s*\(.*\)', '').str.strip().replace(mapper)
)
Run Code Online (Sandbox Code Playgroud)

很简单,完成了.

细节

\d+     # one or more digits
|       # regex OR pipe
\s*     # zero or more whitespace characters
\(      # literal parentheses (opening brace)
.*      # match anything 
\)      # closing brace
Run Code Online (Sandbox Code Playgroud)