仅使用 Pandas 来填补空白，而不是在末端使用 NaN

Question

仅使用 Pandas 来填补空白，而不是在末端使用 NaN

我有一些跨越大约 8 个月的房价数据，并跟踪房屋上市直至售出的价格。我想填充中间的数据中的几个空白，但我想保留每个末尾的 NaN 不变。

举一个简单的例子，假设我们有 house1，它在“第 4 天”以 200000 的价格上市，在“第 9 天”以 190000 的价格出售。我们有 house2 在第 1 天到第 12 天保持在 180000 并且在那个时间窗口内不出售。但是，第 6 天和第 7 天出了点问题，我丢失了数据：

house1 = [NaN, NaN, NaN, 200000, 200000, NaN, NaN, 200000, 190000, NaN, NaN, NaN]
house2 = [180000, 180000, 180000, 180000, 180000, NaN, NaN, 180000, 180000, 180000, 180000, 180000]

Run Code Online (Sandbox Code Playgroud)

现在想象一下，这些是 Pandas Dataframes 中按日期索引的列，而不是常规数组。

问题是，我通常用来填补这里空白的函数是DataFrame.fillna()使用 backfill 或 ffill 方法。如果我使用填充，house1 会返回：

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, 190000]

Run Code Online (Sandbox Code Playgroud)

这填补了空白，但也错误地填充了销售日之后的数据。如果我改用回填，我会得到这个：

house1 = [200000, 200000, 200000, 200000, 200000, 200000, 200000, 200000, 190000, NaN, NaN, NaN]

Run Code Online (Sandbox Code Playgroud)

再次，它填补了空白，但这次它也填补了数据的前端。如果我将 'limit=2' 与填充一起使用，那么我得到的是：

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, NaN]

Run Code Online (Sandbox Code Playgroud)

它再次填补了空白，但随后它也开始填充超出“真实”数据结束位置的数据。

到目前为止，我的解决方案是编写以下函数：

def fillGaps(houseDF):
    """Fills up holes in the housing data"""

    def fillColumns(column):
        filled_col = column
        lastValue = None
        # Keeps track of if we are dealing with a gap in numbers
        gap = False
        i = 0
        for currentValue in filled_col:
            # Loops over all the nans before the numbers begin
            if not isANumber(currentValue) and lastValue is None:
                pass
            # Keeps track of the last number we encountered before a gap
            elif isANumber(currentValue) and (gap is False):
                lastIndex = i
                lastValue = currentValue
            # Notes when we encounter a gap in numbers
            elif not isANumber(currentValue):
                gap = True
            # Fills in the gap
            elif isANumber(currentValue):
                gapIndicies = range(lastIndex + 1, i)
                for j in gapIndicies:
                    filled_col[j] = lastValue
                gap = False
            i += 1
        return filled_col

    filled_df = houseDF.apply(fillColumns, axis=0)
    return filled_df

Run Code Online (Sandbox Code Playgroud)

它只是跳过前面的所有 NaN，填充间隙（由真实值之间的 NaN 组定义），并且最后不填充 NaN。

有没有更干净的方法来做到这一点，或者我不知道的内置熊猫功能？

Answer 1

小智 6

一年后我找到了这个答案，但需要它在具有多列的 DataFrame 上工作，所以我想把我的解决方案留在这里，以防其他人需要相同的解决方案。我的功能只是YS-L的修改版

def fillna_downbet(df):
    df = df.copy()
    for col in df:
        non_nans = df[col][~df[col].apply(np.isnan)]
        start, end = non_nans.index[0], non_nans.index[-1]
        df[col].loc[start:end] = df[col].loc[start:end].fillna(method='ffill')
    return df

Run Code Online (Sandbox Code Playgroud)

谢谢！

Answer 2

Jam*_*mes 5

具有多列的 DataFrame 的另一种解决方案

df.fillna(method='ffill') + (df.fillna(method='bfill') * 0)

Run Code Online (Sandbox Code Playgroud)

它是如何工作的？

第一个fillna执行值的前向填充。这几乎就是我们想要的，只不过它在每个系列的末尾留下了填充值的痕迹。

第二个fillna对我们乘以零的值进行向后填充。结果是我们不需要的尾随值将为 NaN，其他所有值都将为 0。

最后，我们利用 x + 0 = x 和 x + NaN = NaN 这一事实，将两者相加。

归档时间：	10 年，9 月前
查看次数：	2401 次
最近记录：	4 年，3 月前