熊猫。基于缺失值和列名的新列

Doc*_*EXE 2 python dataframe pandas

假设我们有以下 df:

+---+---------+---------+--------+-------+
|   |  2016   |  2017   |  2018  | 2019  |
+---+---------+---------+--------+-------+
| 0 | 26560.0 | 26810.0 | NaN    | NaN   |
| 1 |   570.0 | NaN     | 550.0  | 540.0 |
| 2 |  3770.0 | 3450.0  | 3210.0 | NaN   |
| 3 |  4320.0 | NaN     | NaN    | NaN   |
+---+---------+---------+--------+-------+
Run Code Online (Sandbox Code Playgroud)

我想添加两个额外的列“值”和“年”。在“值”列中会有最近一年的值,在“年”列中应该有最近一年没有缺失值:

+---+---------+---------+--------+-------+---------+------+
|   |  2016   |  2017   |  2018  | 2019  |  value  | year |
+---+---------+---------+--------+-------+---------+------+
| 0 | 26560.0 | 26810.0 | NaN    | NaN   | 26810.0 | 2017 |
| 1 |   570.0 | NaN     | 550.0  | 540.0 |   540.0 | 2019 |
| 2 |  3770.0 | 3450.0  | 3210.0 | NaN   |  3210.0 | 2018 |
| 3 |  4320.0 | NaN     | NaN    | NaN   |  4320.0 | 2016 |
+---+---------+---------+--------+-------+---------+------+
Run Code Online (Sandbox Code Playgroud)

你能帮我解决一下吗。谢谢!

jez*_*ael 6

使用DataFrame.assign新的栏目,第一向前位置与每个选择最后一列列填充缺失值和第二获得最后一个非受缺失值DataFrame.idxmax,而是通过索引列的必要变更单:

df1 = df.assign(value = df.ffill(axis=1).iloc[:, -1],
                year = df.notna().iloc[:, ::-1].idxmax(axis=1))
print (df1)
      2016     2017    2018   2019    value  year
0  26560.0  26810.0     NaN    NaN  26810.0  2017
1    570.0      NaN   550.0  540.0    540.0  2019
2   3770.0   3450.0  3210.0    NaN   3210.0  2018
3   4320.0      NaN     NaN    NaN   4320.0  2016
Run Code Online (Sandbox Code Playgroud)

以上解决方案仅在至少存在非numpy.where缺失值时才有效,对于缺失值的通用解决方案,如果不存在 val:

print (df)
      2016     2017    2018   2019
0  26560.0  26810.0     NaN    NaN
1    570.0      NaN   550.0  540.0
2   3770.0   3450.0  3210.0    NaN
3      NaN      NaN     NaN    NaN

mask = df.notna()
df2 = df.assign(value = df.ffill(axis=1).iloc[:, -1],
               year = np.where(mask.any(axis=1), mask.iloc[:, ::-1].idxmax(axis=1), np.nan))
print (df2)
      2016     2017    2018   2019    value  year
0  26560.0  26810.0     NaN    NaN  26810.0  2017
1    570.0      NaN   550.0  540.0    540.0  2019
2   3770.0   3450.0  3210.0    NaN   3210.0  2018
3      NaN      NaN     NaN    NaN      NaN   NaN
Run Code Online (Sandbox Code Playgroud)

另一个想法与DataFrame.stackDataFrame.drop_duplicates也工作,如果某一行只包含遗漏值:

df2 = df.join(df.stack()
                .reset_index(name='value')
                .drop_duplicates('level_0', keep='last')
                .rename(columns={'level_1':'year'})
                .set_index('level_0')
                [['value','year']])
print (df2)
      2016     2017    2018   2019    value  year
0  26560.0  26810.0     NaN    NaN  26810.0  2017
1    570.0      NaN   550.0  540.0    540.0  2019
2   3770.0   3450.0  3210.0    NaN   3210.0  2018
3   4320.0      NaN     NaN    NaN   4320.0  2016
Run Code Online (Sandbox Code Playgroud)
df2 = df.join(df.stack()
                .reset_index(name='value')
                .drop_duplicates('level_0', keep='last')
                .rename(columns={'level_1':'year'})
                .set_index('level_0')
                [['value','year']])
print (df2)
      2016     2017    2018   2019    value  year
0  26560.0  26810.0     NaN    NaN  26810.0  2017
1    570.0      NaN   550.0  540.0    540.0  2019
2   3770.0   3450.0  3210.0    NaN   3210.0  2018
3      NaN      NaN     NaN    NaN      NaN   NaN
Run Code Online (Sandbox Code Playgroud)