如何加载excel表并清理python中的数据？

Question

如何加载excel表并清理python中的数据？

pra*_*het 0 python dataframe pandas data-science

从文件Energy Indicators.xls中加载能源数据，该文件是联合国2013年能源供应和可再生电力生产指标列表，应放入变量名称为energy的DataFrame中。

请记住，这是一个 Excel 文件，而不是逗号分隔值文件。此外，请确保从数据文件中排除页脚和页眉信息。前两列是不必要的，因此您应该删除它们，并且您应该更改列标签，以便这些列是：

['国家'、'能源供应'、'人均能源供应'、'可再生能源百分比'] 将能源供应转换为千兆焦耳（1,000,000 亿焦耳）。对于所有缺少数据（例如带有“...”的数据）的国家/地区，请确保将其反映为 np.NaN 值。

重命名以下国家/地区列表（用于后面的问题）：“大韩民国”：“韩国”，“美利坚合众国”：“美国”，“大不列颠及北爱尔兰联合王国”：“美国王国”、“中国、香港特别行政区”：“香港”

还有几个国家的名称中带有数字和/或括号。一定要删除这些，例如'Bolivia (Plurinational State of)'应该是'Bolivia'，'Switzerland17'应该是'Switzerland'。

接下来，从文件world_bank.csv 中加载GDP 数据，该文件是一个包含世界银行从1960 年到2015 年的各国GDP 的csv。称之为 DataFrame GDP。确保跳过标题，并重命名以下国家/地区列表： "Korea, Rep.": "South Korea", "Iran, Islam Rep.": "Iran", "Hong Kong SAR, China": "Hong Kong ”

最后，从文件 scimagojr-3.xlsx 中加载能源工程和电力技术的 Sciamgo 期刊和国家排名数据，该文件根据国家在上述领域的期刊贡献进行排名。调用此 DataFrame ScimEn。

将三个数据集：GDP、Energy 和 ScimEn 加入一个新数据集（使用国家名称的交集）。仅使用过去 10 年（2006-2015）的 GDP 数据和 Scimagojr '排名'（排名 1 至 15）的前 15 个国家。

这个DataFrame的索引应该是国家名称，列应该是['Rank', 'Documents', 'Citable documents', 'Citations', 'Self- citations', 'Citations per document', 'H指数”、“能源供应”、“人均能源供应”、“可再生能源百分比”、“2006”、“2007”、“2008”、“2009”、“2010”、“2011”、“2012”、“2013” ', '2014', '2015']。

此函数应返回一个包含 20 列和 15 个条目的 DataFrame。

我为这个问题尝试了以下代码，但它只返回 12 行而不是 15 行：

import pandas as pd

from pandas import ExcelWriter

from pandas import ExcelFile

pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

Energy = pd.read_excel('Energy Indicators.xls')

Energy.drop(Energy.columns[[0,1]],axis=1,inplace=True)

Energy.columns=['Country','Energy Supply','Energy Supply per capita','% Renewable']

Energy['Energy Supply']*=1000000

Energy['Country'] = Energy['Country'].str.replace(r"\(.*\)","")

Energy['Country'] = Energy['Country'].str.replace("[0-9()]+$", "")

Energy.replace('Republic of Korea','South Korea', inplace = True)

Energy.replace('United States of America','United States', inplace = True)

Energy.replace('United Kingdom of Great Britain and Northern Ireland','United Kingdom', inplace = True)

Energy.replace('China, Hong Kong Special Administrative Region','Hong Kong', inplace = True)

import pandas as pd

GDP = pd.read_csv('world_bank.csv', index_col=0, header=None)

GDP = GDP.drop(['Data Source'])

GDP = GDP.dropna()

GDP = GDP.reset_index()

GDP.columns = GDP.iloc[0]

GDP.drop(GDP.index[[0,3]], inplace=True)

GDP = GDP.rename(columns={'Country Name': 'Country'})

GDP.replace(',','-', inplace=True)

GDP = GDP.replace('Korea, Rep.','South Korea')

GDP = GDP.replace('Iran, Islamic Rep.','Iran')

GDP = GDP.replace('Hong Kong SAR, China','Hong Kong')


import pandas as pd

from pandas import ExcelWriter

from pandas import ExcelFile

pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

ScimEn = pd.read_excel('scimagojr-3.xlsx')


b = pd.merge(pd.merge(Energy,GDP,on='Country'),ScimEn,on='Country')

a = pd.merge(pd.merge(Energy,GDP,on='Country'),ScimEn,on='Country')

a = a.sort(['Rank'], ascending=[True])

a = a[a["Rank"] < 16]

a=a.rename(columns = {'2006.0':'abc'})

a.columns.values[53] = "2006"

a.columns.values[54] = "2007"

a.columns.values[55] = "2008"

a.columns.values[56] = "2009"

a.columns.values[57] = "2010"

a.columns.values[58] = "2011"

a.columns.values[59] = "2012"

a.columns.values[60] = "2013"

a.columns.values[61] = "2014"

a.columns.values[62] = "2015"


a = a[['Country','Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]


a = a.set_index('Country')


def ans():

    return a

ans()

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 5

import numpy as np 
import pandas as pd 


def energy():
    energy=pd.ExcelFile('Energy Indicators.xls').parse('Energy')
    energy=energy.iloc[16:243][['Environmental Indicators: Energy','Unnamed: 3','Unnamed: 4','Unnamed: 5']].copy()
    energy.columns=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']

    energy = energy.replace('...', np.nan)
    energy['Energy Supply']=energy['Energy Supply']*1000000

    energy = energy.replace("Republic of Korea", "South Korea")
    energy = energy.replace("United States of America", "United States")
    energy = energy.replace("United Kingdom of Great Britain and Northern Ireland","United Kingdom")
    energy = energy.replace("China, Hong Kong Special Administrative Region", "Hong Kong")

    energy['Country'] = energy['Country'].str.extract('(^[a-zA-Z\s]+)', expand=False).str.strip()   

    energy=energy.reset_index()
    energy=energy[['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']]
    return energy.iloc[43]

def GDP():
    GDP=pd.read_csv('world_bank.csv')
    s=(GDP.iloc[3].values)[:4].astype(str).tolist()+(GDP.iloc[3].values)[4:].astype(int).astype(str).tolist()
    GDP=GDP.iloc[4:]
    GDP.columns=s
    GDP=GDP[['Country Name','2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]
    GDP.columns=['Country','2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']
    GDP=GDP.replace("Korea, Rep.", "South Korea",regex=False)

    GDP=GDP.replace("Iran, Islamic Rep.","Iran")

    GDP=GDP.replace("Hong Kong SAR, China","Hong Kong",regex=False)
    return GDP

def ScimEn():
    ScimEn=pd.ExcelFile('scimagojr-3.xlsx').parse('Sheet1')

    return ScimEn

def result():   
    e= energy()
    G=GDP()
    S=ScimEn()
    tdf=pd.merge(e,G,on='Country')
    tdf=pd.merge(tdf,S,on='Country')
    res = tdf.sort_values(by=['Rank'], inplace = True)
    res = tdf.head(15)
    res=res.set_index('Country', inplace=False)
    return res

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，4 月前
查看次数：	11325 次
最近记录：	5 年，9 月前