Mar*_*dez 2 python interpolation dataframe panel-data
我有一个如下所示的数据集:
我使用pandas.read_csv将"年份"和"国家/地区"列作为索引导入到pandas数据框中.我需要做的是将时间步长从每5年改为每年,并插入所述值,我真的不知道如何做到这一点.我正在学习R和python,所以对这两种语言的帮助都会受到高度赞赏.
如果您为DataFrame提供DatetimeIndex,那么您可以利用df.resample和df.interpolate('time')方法.
要制作df.indexDatetimeIndex,您可能会想要使用它set_index('Year').然而,Year由于每个都重复,因此它本身并不是唯一的Country.为了打电话,resample我们需要一个独特的索引.所以df.pivot改用:
# convert integer years into `datetime64` values
In [441]: df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
In [442]: df.pivot(index='Year', columns='Country')
Out[442]:
Avg1 Avg2
Country Australia Austria Belgium Australia Austria Belgium
Year
1950-01-01 0 0 0 0 0 0
1955-01-01 1 1 1 10 10 10
1960-01-01 2 2 2 20 20 20
1965-01-01 3 3 3 30 30 30
Run Code Online (Sandbox Code Playgroud)然后,您可以使用df.resample('A').mean()到重采样数据与每年的频率.您可以将其resample('A')视为df切入1年间隔的组. resample返回一个DatetimeIndexResampler对象,该
mean方法的方法通过取均值来聚合每个组中的值.因此
mean(),每年返回一行DataFrame.由于您的原始版本
df每5年有一个数据,因此大多数1年组都是空的,因此这些年份的均值返回NaN.如果您的数据间隔为5年,那么.mean()您可以使用.first()或
.last()代替.他们都会返回相同的结果.
In [438]: df.resample('A').mean()
Out[438]:
Avg1 Avg2
Country Australia Austria Belgium Australia Austria Belgium
Year
1950-12-31 0.0 0.0 0.0 0.0 0.0 0.0
1951-12-31 NaN NaN NaN NaN NaN NaN
1952-12-31 NaN NaN NaN NaN NaN NaN
1953-12-31 NaN NaN NaN NaN NaN NaN
1954-12-31 NaN NaN NaN NaN NaN NaN
1955-12-31 1.0 1.0 1.0 10.0 10.0 10.0
1956-12-31 NaN NaN NaN NaN NaN NaN
1957-12-31 NaN NaN NaN NaN NaN NaN
1958-12-31 NaN NaN NaN NaN NaN NaN
1959-12-31 NaN NaN NaN NaN NaN NaN
1960-12-31 2.0 2.0 2.0 20.0 20.0 20.0
1961-12-31 NaN NaN NaN NaN NaN NaN
1962-12-31 NaN NaN NaN NaN NaN NaN
1963-12-31 NaN NaN NaN NaN NaN NaN
1964-12-31 NaN NaN NaN NaN NaN NaN
1965-12-31 3.0 3.0 3.0 30.0 30.0 30.0
Run Code Online (Sandbox Code Playgroud)然后df.interpolate(method='time')将根据最近的非NaN值及其相关的日期时间索引值线性插入缺失的NaN值.
import numpy as np
import pandas as pd
countries = 'Australia Austria Belgium'.split()
year = np.arange(1950, 1970, 5)
df = pd.DataFrame(
{'Country': np.repeat(countries, len(year)),
'Year': np.tile(year, len(countries)),
'Avg1': np.tile(np.arange(len(year)), len(countries)),
'Avg2': 10*np.tile(np.arange(len(year)), len(countries))})
df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
df = df.pivot(index='Year', columns='Country')
df = df.resample('A').mean()
df = df.interpolate(method='time')
df = df.stack('Country')
df = df.reset_index()
df = df.sort_values(by=['Country', 'Year'])
print(df)
Run Code Online (Sandbox Code Playgroud)
产量
Year Country Avg1 Avg2
0 1950-12-31 Australia 0.000000 0.000000
3 1951-12-31 Australia 0.199890 1.998905
6 1952-12-31 Australia 0.400329 4.003286
9 1953-12-31 Australia 0.600219 6.002191
12 1954-12-31 Australia 0.800110 8.001095
15 1955-12-31 Australia 1.000000 10.000000
18 1956-12-31 Australia 1.200328 12.003284
21 1957-12-31 Australia 1.400109 14.001095
...
Run Code Online (Sandbox Code Playgroud)