外推数据框行

Question

外推数据框行

Mar*_*man 1 interpolation scipy dataframe pandas extrapolation

我有一个df喜欢

d = {'col1': [np.nan, np.nan, 1],
     'col2': [1, 1, 2],
     'col3': [2, 2, 3],
     'col4': [np.nan, 3, np.nan]}
df = pd.DataFrame(data=d)

Run Code Online (Sandbox Code Playgroud)

并希望对行进行外推以填充任何尾随nans。

预期输出：

d2 = {'col1': [np.nan, np.nan, 1],
      'col2': [1, 1, 2],
      'col3': [2, 2, 3],
      'col4': [3, 3, 4]}
df2 = pd.DataFrame(data=d2)

Run Code Online (Sandbox Code Playgroud)

编辑：每行的斜率都不同。我试过了，df.interpolate(method='linear')但这给了我尾随nans的平坦趋势

Answer 1

Mr.*_*. T 5

pandas.interpolate主要是scipy's 插值函数的包装器，有许多关键字可以让您调整插值。你可以使用spline：

d = {'col1': [np.nan, np.nan, 1, 5, 9, np.nan],
     'col2': [1, 1, 2, 5, 8, np.nan],
     'col3': [2, 2, 3, 4, 5, np.nan],
     'col4': [np.nan, 3, np.nan, 5, 6, np.nan]}
df = pd.DataFrame(data=d)

df = df.interpolate(method = "spline", order = 1, limit_direction = "both")
print(df)

Run Code Online (Sandbox Code Playgroud)

输出：

   col1  col2  col3  col4
0  -7.0   1.0   2.0   2.0
1  -3.0   1.0   2.0   3.0
2   1.0   2.0   3.0   4.0
3   5.0   5.0   4.0   5.0
4   9.0   8.0   5.0   6.0
5  13.0   8.8   5.6   7.0

Run Code Online (Sandbox Code Playgroud)

编辑：
熊猫中可能有更优雅的解决方案，但这是解决问题的一种方法：

d = {'col1 Mar': [np.nan, np.nan, 1],
     'col2 Jun': [1, 1, 2],
     'col3 Sep': [2, 2, 3],
     'col4 Dec': [np.nan, 3, np.nan]}
df = pd.DataFrame(data=d)
print(df)
#store temporarily the column index
col_index = df.columns
#transcribe month into a number that reflects the time distance
df.columns = [3, 6, 9, 12]

#interpolate over rows
df = df.interpolate(method = "spline", order = 1,  limit_direction = "both", axis = 1, downcast = "infer")
#assign back the original index
df.columns = col_index
print(df)

Run Code Online (Sandbox Code Playgroud)

输出：

   col1 Mar   col2 Jun  col3 Sep  col4 Dec
0       NaN          1         2       NaN
1       NaN          1         2       3.0
2       1.0          2         3       NaN
   col1 Mar   col2 Jun  col3 Sep  col4 Dec
0         0          1         2         3
1         0          1         2         3
2         1          2         3         4

Run Code Online (Sandbox Code Playgroud)

如果您将列索引作为日期时间对象提供，您可能可以直接使用列索引，但我不确定。

编辑 2： 正如预期的那样，您还可以使用日期时间对象作为列名进行插值：

CSV 文件

Mar 2014, Jun 2014, Sep 2014, Mar 2015
nan,        1,        2,      nan
nan,        1,        2,      4
1,          2,        3,      nan

Run Code Online (Sandbox Code Playgroud)

代码：

#read CSV file
df = pd.read_csv("test.txt", sep = r',\s*')
#convert column names to datetime objects
df.columns = pd.to_datetime(df.columns)
#interpolate over rows
df = df.interpolate(method = "spline", order = 1,  limit_direction = "both", axis = 1, downcast = "infer")
print(df)

Run Code Online (Sandbox Code Playgroud)

输出：

   2014-03-01  2014-06-01  2014-09-01  2015-03-01
0    0.000000         1.0         2.0    3.967391
1   -0.016457         1.0         2.0    4.000000
2    1.000000         2.0         3.0    4.967391

Run Code Online (Sandbox Code Playgroud)

结果现在不再是整数，因为三个月的天数不同。

归档时间：	7 年，5 月前
查看次数：	1688 次
最近记录：	7 年，5 月前