如何使用Python和/或R在数据帧之间插值

Mar*_*dez 2 python interpolation dataframe panel-data

我有一个如下所示的数据集:

数据集的图像

我使用pandas.read_csv将"年份"和"国家/地区"列作为索引导入到pandas数据框中.我需要做的是将时间步长从每5年改为每年,并插入所述值,我真的不知道如何做到这一点.我正在学习R和python,所以对这两种语言的帮助都会受到高度赞赏.

unu*_*tbu 6

  • 如果您为DataFrame提供DatetimeIndex,那么您可以利用df.resampledf.interpolate('time')方法.

  • 要制作df.indexDatetimeIndex,您可能会想要使用它set_index('Year').然而,Year由于每个都重复,因此它本身并不是唯一的Country.为了打电话,resample我们需要一个独特的索引.所以df.pivot改用:

    # convert integer years into `datetime64` values
    In [441]: df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
    In [442]: df.pivot(index='Year', columns='Country')
    Out[442]: 
                    Avg1                      Avg2                
    Country    Australia Austria Belgium Australia Austria Belgium
    Year                                                          
    1950-01-01         0       0       0         0       0       0
    1955-01-01         1       1       1        10      10      10
    1960-01-01         2       2       2        20      20      20
    1965-01-01         3       3       3        30      30      30
    
    Run Code Online (Sandbox Code Playgroud)
  • 然后,您可以使用df.resample('A').mean()重采样数据与每年的频率.您可以将其resample('A')视为df切入1年间隔的组. resample返回一个DatetimeIndexResampler对象,该 mean方法的方法通过取均值来聚合每个组中的值.因此 mean(),每年返回一行DataFrame.由于您的原始版本 df每5年有一个数据,因此大多数1年组都是空的,因此这些年份的均值返回NaN.如果您的数据间隔为5年,那么.mean()您可以使用.first().last()代替.他们都会返回相同的结果.

    In [438]: df.resample('A').mean()
    Out[438]: 
                    Avg1                      Avg2                
    Country    Australia Austria Belgium Australia Austria Belgium
    Year                                                          
    1950-12-31       0.0     0.0     0.0       0.0     0.0     0.0
    1951-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1952-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1953-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1954-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1955-12-31       1.0     1.0     1.0      10.0    10.0    10.0
    1956-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1957-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1958-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1959-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1960-12-31       2.0     2.0     2.0      20.0    20.0    20.0
    1961-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1962-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1963-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1964-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1965-12-31       3.0     3.0     3.0      30.0    30.0    30.0
    
    Run Code Online (Sandbox Code Playgroud)
  • 然后df.interpolate(method='time')将根据最近的非NaN值及其相关的日期时间索引值线性插入缺失的NaN值.


import numpy as np
import pandas as pd

countries = 'Australia Austria Belgium'.split()
year = np.arange(1950, 1970, 5)
df = pd.DataFrame(
    {'Country': np.repeat(countries, len(year)),
     'Year': np.tile(year, len(countries)),
     'Avg1': np.tile(np.arange(len(year)), len(countries)),
     'Avg2': 10*np.tile(np.arange(len(year)), len(countries))})
df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
df = df.pivot(index='Year', columns='Country')

df = df.resample('A').mean()
df = df.interpolate(method='time')

df = df.stack('Country')
df = df.reset_index()
df = df.sort_values(by=['Country', 'Year'])
print(df)
Run Code Online (Sandbox Code Playgroud)

产量

         Year    Country      Avg1       Avg2
0  1950-12-31  Australia  0.000000   0.000000
3  1951-12-31  Australia  0.199890   1.998905
6  1952-12-31  Australia  0.400329   4.003286
9  1953-12-31  Australia  0.600219   6.002191
12 1954-12-31  Australia  0.800110   8.001095
15 1955-12-31  Australia  1.000000  10.000000
18 1956-12-31  Australia  1.200328  12.003284
21 1957-12-31  Australia  1.400109  14.001095
...
Run Code Online (Sandbox Code Playgroud)

  • @michael_j_ward:我对`datetime64`的理解主要来自http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html以及大量的愚蠢行为.文档提到(以及dtype名称`datetime64`强烈提示)底层数据类型是8字节的int.因此,为了在datetime64s上进行数值运算,有时需要使用`astype('i8')`将`datetime64`转换为其基础整数值.`Code`列[显示在这里](http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units)显示了可能的`datetime64 [...]`dtypes. (2认同)