具有多个输入和多个输出列的 Pandas 应用、滚动、分组

Question

具有多个输入和多个输出列的 Pandas 应用、滚动、分组

过去一周我一直在努力尝试使用apply在整个 Pandas 数据帧上使用函数，包括滚动窗口、groupby，尤其是多个输入列和多个输出列。我在 SO 上发现了大量关于此主题的问题以及许多旧的和过时的答案。所以我开始为 x 输入和输出、滚动、滚动和 groupby 组合的每一种可能组合创建一个笔记本，我也专注于性能。由于我不是唯一一个为这些问题而苦苦挣扎的人，我想我会在这里提供我的解决方案和工作示例，希望它可以帮助任何现有/未来的熊猫用户。

Answer 1

Bob*_*aaf 5

重要笔记

pandas中apply和rolling的结合对输出的要求非常强。您必须返回一个值。你不能返回一个 pd.Series，不是一个列表，不是一个数组，不是秘密地返回一个数组中的一个数组，而只能返回一个值，例如一个整数。当尝试为多列返回多个输出时，此要求使得很难获得有效的解决方案。我不明白为什么它对“应用和滚动”有这个要求，因为没有滚动“应用”就没有这个要求。一定是由于某些内部 Pandas 功能。
“应用和滚动”与多个输入列的组合根本不起作用！想象一个具有 2 列、6 行的数据框，并且您想要应用滚动窗口为 2 的自定义函数。您的函数应该获得一个具有 2x2 值的输入数组 - 每列 2 个值用于 2 行。但似乎熊猫无法同时处理滚动和多个输入列。我试图使用轴参数来让它工作，但是：
- Axis = 0，将按列调用您的函数。在上面描述的数据帧中，它会调用你的函数 10 次（不是 12 次，因为滚动 = 2）并且因为它是每列，它只提供该列的 2 个滚动值......
- Axis = 1，将每行调用您的函数。这可能是您想要的，但 Pandas 不会提供 2x2 输入。它实际上完全忽略了滚动，只提供了一行 2 列的值......
当对多个输入列使用“apply”时，您可以提供一个名为 raw (boolean) 的参数。默认情况下为 False，这意味着输入将是 pd.Series，因此在值旁边包含索引。如果您不需要索引，您可以将 raw 设置为 True 以获得 Numpy 数组，这通常可以获得更好的性能。
当组合“rolling & groupby”时，它返回一个多索引系列，该系列不能轻易用作新列的输入。最简单的解决方案是附加一个 reset_index(drop=True) 作为回答和评论here（Python - GroupBy 对象的滚动函数）。
您可能会问我，您什么时候想要使用具有多个输出的滚动的 groupby 自定义函数！？回答：我最近不得不使用滑动窗口（滚动）对数据集 (groupby) 中的不同批次的 500 万条记录（速度/性能很重要）的数据集进行傅立叶变换。而且我需要在不同的列（多个输出）中保存傅立叶变换的功率和相位。大多数人可能只需要下面的一些基本示例，但我相信，尤其是在机器学习/数据科学领域，更复杂的示例可能会很有用。
如果您有更好、更清晰或更快的方法来执行以下任何解决方案，请告诉我。我会更新我的答案，我们都可以受益！

代码示例

让我们首先创建一个数据框，它将在下面的所有示例中使用，包括 groupby 示例的 group-column。对于滚动窗口和多个输入/输出列，我在下面的所有代码示例中只使用 2，但显然这可以是任何大于 1 的数字。

df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab')) df['group'] = [0, 0, 0, 1, 1, 1] df = df[['group', 'a', 'b']]
Run Code Online (Sandbox Code Playgroud)
它看起来像这样：

group a b 0 0 2 2 1 0 4 1 2 0 0 4 3 1 0 2 4 1 3 2 5 1 3 0
Run Code Online (Sandbox Code Playgroud)

输入1列，输出1列

基本的

def func_i1_o1(x): return x+1 df['c'] = df['b'].apply(func_i1_o1)
Run Code Online (Sandbox Code Playgroud)

滚动

def func_i1_o1_rolling(x): return (x[0] + x[1]) df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)
Run Code Online (Sandbox Code Playgroud)

滚动和分组

将 reset_index 解决方案（见上面的注释）添加到滚动函数中。

df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)
Run Code Online (Sandbox Code Playgroud)

输入2列，输出1列

基本的

def func_i2_o1(x): return np.sum(x) df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)
Run Code Online (Sandbox Code Playgroud)

滚动

正如上面注释中的第 2 点所解释的那样，对于 2 个输入没有“正常”的解决方案。下面的解决方法使用 'raw=False' 来确保输入是 pd.Series，这意味着我们还会获得值旁边的索引。这使我们能够从要使用的正确索引处的其他列中获取值。

def func_i2_o1_rolling(x): values_b = x values_c = df.loc[x.index, 'c'].to_numpy() return np.sum(values_b) + np.sum(values_c) df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)
Run Code Online (Sandbox Code Playgroud)

滚动和分组

将 reset_index 解决方案（见上面的注释）添加到滚动函数中。

df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)
Run Code Online (Sandbox Code Playgroud)

输入1列，输出2列

基本的

您可以通过返回 pd.Series 来使用“正常”解决方案：

def func_i1_o2(x): return pd.Series((x+1, x+2)) df[['i', 'j']] = df['b'].apply(func_i1_o2)
Run Code Online (Sandbox Code Playgroud)
或者您可以使用快 8 倍的 zip/tuple 组合！

def func_i1_o2_fast(x): return x+1, x+2 df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))
Run Code Online (Sandbox Code Playgroud)

滚动

正如上面注释中的第 1 点所解释的那样，如果我们想在使用滚动和应用组合时返回 1 个以上的值，我们需要一种解决方法。我找到了 2 个可行的解决方案。

1

def func_i1_o2_rolling_solution1(x): output_1 = np.max(x) output_2 = np.min(x) # Last index is where to place the final values: x.index[-1] df.at[x.index[-1], ['m', 'n']] = output_1, output_2 return 0 df['m'], df['n'] = (np.nan, np.nan) df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)
Run Code Online (Sandbox Code Playgroud)
优点：一切都在 1 个函数内完成。
缺点：您必须先创建列，而且速度较慢，因为它不使用原始输入。

2

rolling_w = 2 nan_prefix = (rolling_w - 1) * [np.nan] output_list_1 = nan_prefix.copy() output_list_2 = nan_prefix.copy() def func_i1_o2_rolling_solution2(x): output_list_1.append(np.max(x)) output_list_2.append(np.min(x)) return 0 df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True) df['o'] = output_list_1 df['p'] = output_list_2
Run Code Online (Sandbox Code Playgroud)
优点：它使用原始输入，使其速度提高两倍。而且由于它不使用索引来设置输出值，因此代码看起来更清晰（至少对我而言）。
缺点：您必须自己创建 nan 前缀，并且需要多行代码。

滚动和分组

通常，我会使用上面更快的第二个解决方案。但是，由于我们正在组合组并滚动，这意味着您必须在数据集中间某处的正确索引处手动设置 NaN/零（取决于组数）。在我看来，当组合滚动、分组和多个输出列时，第一个解决方案更容易并自动解决自动 NaN/分组。再一次，我在最后使用了 reset_index 解决方案。

def func_i1_o2_rolling_groupby(x): output_1 = np.max(x) output_2 = np.min(x) # Last index is where to place the final values: x.index[-1] df.at[x.index[-1], ['q', 'r']] = output_1, output_2 return 0 df['q'], df['r'] = (np.nan, np.nan) df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)
Run Code Online (Sandbox Code Playgroud)

输入2列，输出2列

基本的

我建议使用与 i1_o2 相同的“快速”方式，唯一的区别是您可以使用 2 个输入值。

def func_i2_o2(x): return np.mean(x), np.median(x) df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))
Run Code Online (Sandbox Code Playgroud)

滚动

当我使用一种解决方法来应用多个输入的滚动并且我使用另一种解决方法来滚动多个输出时，您可能猜到我需要将它们组合起来。
1. 使用索引从其他列中获取值（请参阅 func_i2_o1_rolling）
2. 在正确的索引上设置最终的多个输出（请参阅 func_i1_o2_rolling_solution1）

def func_i2_o2_rolling(x): values_b = x.to_numpy() values_c = df.loc[x.index, 'c'].to_numpy() output_1 = np.min([np.sum(values_b), np.sum(values_c)]) output_2 = np.max([np.sum(values_b), np.sum(values_c)]) # Last index is where to place the final values: x.index[-1] df.at[x.index[-1], ['u', 'v']] = output_1, output_2 return 0 df['u'], df['v'] = (np.nan, np.nan) df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)
Run Code Online (Sandbox Code Playgroud)

滚动和分组

将 reset_index 解决方案（见上面的注释）添加到滚动函数中。

def func_i2_o2_rolling_groupby(x): values_b = x.to_numpy() values_c = df.loc[x.index, 'c'].to_numpy() output_1 = np.min([np.sum(values_b), np.sum(values_c)]) output_2 = np.max([np.sum(values_b), np.sum(values_c)]) # Last index is where to place the final values: x.index[-1] df.at[x.index[-1], ['w', 'x']] = output_1, output_2 return 0 df['w'], df['x'] = (np.nan, np.nan) df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)
Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，10 月前
查看次数：	658 次
最近记录：	5 年，10 月前