我有一个看起来像这样的数据框:
>>> df
value
time
2020-01-31 07:59:43.232 -6
2020-01-31 07:59:43.232 -2
2020-01-31 07:59:43.232 -1
2020-01-31 07:59:43.264 1
2020-01-31 07:59:43.389 0
2020-01-31 07:59:43.466 1
2020-01-31 07:59:43.466 5
2020-01-31 07:59:43.466 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 5
2020-01-31 07:59:43.467 1
Run Code Online (Sandbox Code Playgroud)
我想再添加 3 列,以某个数字显示正负值的比率。例如,如果数字是 8:
value neg pos total
time
2020-01-31 07:59:43.232 -6
2020-01-31 07:59:43.232 -2 8 0 8
2020-01-31 07:59:43.232 -1
2020-01-31 07:59:43.264 1
2020-01-31 07:59:43.389 0
2020-01-31 07:59:43.466 1
2020-01-31 07:59:43.466 5 1 7 8
2020-01-31 07:59:43.466 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 5 3 5 8
2020-01-31 07:59:43.467 1
Run Code Online (Sandbox Code Playgroud)
如果数字是 5:
value neg pos total
time
2020-01-31 07:59:43.232 -6 5 0 5 # take just 5 out of -6 and the rest(-1) is used for the next calculation
2020-01-31 07:59:43.232 -2
2020-01-31 07:59:43.232 -1
2020-01-31 07:59:43.264 1 4 1 5 # sum(abs(list(-1, -2, -1, 1)))
2020-01-31 07:59:43.389 0
2020-01-31 07:59:43.466 1
2020-01-31 07:59:43.466 5 0 5 5 # 1 + 5 -> take just 5(1, 4) out of them and the rest(1) is used for the next calculation
2020-01-31 07:59:43.466 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 -1
2020-01-31 07:59:43.467 5 3 4 5 # 1, -1, -1, -1, 5 -> take just 5(1, -1, -1, -1, 1) out of them and the rest(4) is used for the next calculation
2020-01-31 07:59:43.467 1 0 5 5 # 4, 1
Run Code Online (Sandbox Code Playgroud)
我一直在用循环和几个条件语句进行计算,而且速度很慢。我想知道是否有更有效和更快的方法来做到这一点。
下面的代码显示了当数字为 300(GROUP_SIZE)时我是如何做的
GROUP_SIZE = 300
for DATE in lst_requiredDates:
df = dic_dtf[DATE]
lst_groups = []
lst_group = [0, 0, 0, 0]
for index, row in df.iterrows():
date = index
value = row['value']
abs_value = abs(value)
if (lst_group[3]+abs_value) < GROUP_SIZE:
if value < 0:
lst_group[0] = date
lst_group[1] += abs_value
lst_group[3] += abs_value
else:
lst_group[0] = date
lst_group[2] += abs_value
lst_group[3] += abs_value
elif (lst_group[3]+abs_value) == GROUP_SIZE:
if value < 0:
lst_group[0] = date
lst_group[1] += abs_value
lst_group[3] += abs_value
else:
lst_group[0] = date
lst_group[2] += abs_value
lst_group[3] += abs_value
lst_groups.append(lst_group)
lst_group = [0, 0, 0, 0]
elif (lst_group[3]+abs_value) > GROUP_SIZE:
int_left = (lst_group[3]+abs_value) - GROUP_SIZE
if value < 0:
lst_group[0] = date
lst_group[1] += (abs_value - int_left)
lst_group[3] += (abs_value - int_left)
lst_groups.append(lst_group)
lst_group = [0, 0, 0, 0]
lst_group[0] = date
lst_group[1] += int_left
lst_group[3] += int_left
else:
lst_group[0] = date
lst_group[2] += (abs_value - int_left)
lst_group[3] += (abs_value - int_left)
lst_groups.append(lst_group)
lst_group = [0, 0, 0, 0]
lst_group[0] = date
lst_group[2] += int_left
lst_group[3] += int_left
Run Code Online (Sandbox Code Playgroud)
这是一次对整个数据帧使用操作的解决方案,这应该非常有效。
\n\n我使用了cumsum()两次,一次针对绝对值,以查找何时达到组大小,另一次针对值本身,我们稍后可以使用它来查找neg和pos。
一种用途是shift()查找组边界,其中包含我们要更新的行以及计算总和所需的所有数据。
处理余数并不太难,查看绝对值的累积和并给出最后一个值的符号。
\n\n这里有符号值的累积和就派上用场了。pos - neg调整余数后,我们可以通过与前一行的差来找到当前行的总和。
知道它们加起来等于组大小,我们可以轻松计算这两个单独的值并将它们添加到数据框中。
\n\n代码如下,有评论进一步解释这一切:
\n\nimport pandas as pd\nimport numpy as np\n\ndef get_pos_neg_ratio(series, group_size):\n df = series.rename('value').to_frame()\n\n # Calculate the cumulative sum and the cumulative sum\n # of absolute values. The latter will be used to break\n # the series into groups.\n df_aux = df.copy()\n df_aux['cumsum'] = df['value'].cumsum()\n df_aux['cumabs'] = abs(df['value']).cumsum()\n df_aux['group'] = df_aux['cumabs'] // group_size\n\n # Break it into groups, by locating the boundaries.\n df_aux = df_aux[\n df_aux['group'] != df_aux['group'].shift(fill_value=0)\n ].copy()\n\n # Calculate the remainder on each boundary row. Give\n # it the sign of the value in that row, since that\n # value is the one that got it over the group size.\n df_aux['remainder'] = (\n (df_aux['cumabs'] % group_size) *\n np.sign(df_aux['value'])\n )\n\n # Adjust the sums.by the remainder.\n df_aux['adjsum'] = df_aux['cumsum'] - df_aux['remainder']\n\n # Finally, find the individual sums by subtracting\n # from the adjusted cumulative sum from the previous\n # group. This will be the total sum of positives and\n # negatives for this group.\n df_aux['grpsum'] = (\n df_aux['adjsum'] -\n df_aux['adjsum'].shift(fill_value=0)\n )\n\n # Now we can calculate positives and negatives. We\n # know that their absolute values sum up to group_size\n # and that they sum up to `adjsum`, so a little bit of\n # algebra will get us to:\n df['neg'] = (group_size - df_aux['grpsum']) // 2\n df['pos'] = (group_size + df_aux['grpsum']) // 2\n df['total'] = df['neg'] + df['pos']\n return df\nRun Code Online (Sandbox Code Playgroud)\n\n向函数传递一个 Series (即一列)和一个组大小,它将返回一个包含列(在 name 下value)和计算出的neg,pos和 的DataFrame total。
还要注意的是,这个函数需要一个不重复的索引!否则最终的任务将会失败。我建议您reset_index()首先使用,转换time为常规列,然后可能稍后再set_index()返回。
如果一个值使我们同时超过两个组边界,则此代码将被破坏。因此对于示例数据,它将中断group_size \xe2\x89\xa4 4。也许可以针对这种情况进行修复(我们可以检测到在边界处跳过一组),但目前还不清楚如何处理这些情况,我们是否应该插入一个值为 NaN 的新行,并重复索引额外的组?
由于您在提供的示例中没有提到这种情况,并且您的示例代码使用了 300 的大组大小,因此我想这很可能不是您非常关心的事情,并且当前的方法足够合适。
\n\n需要考虑的另一点是,我们不会保留最后一个不完整组的最终总和,如果我们正在流式传输数据或连接 DataFrame,我们无法真正知道还剩下多少可以帮助我们继续计算的方式。
\n\n同样,由于在您的组大小为 8 的示例中,您似乎没有提到最后一行中 1 的余数,我相信这也不是您真正关心的问题。
\n\n对数据进行示例运行(重置索引后),组大小为 5:
\n\n>>> df = df.reset_index()\n>>> print(get_pos_neg_ratio(df['value'], 5))\n value neg pos total\n0 -6 5.0 0.0 5.0\n1 -2 NaN NaN NaN\n2 -1 NaN NaN NaN\n3 1 4.0 1.0 5.0\n4 0 NaN NaN NaN\n5 1 NaN NaN NaN\n6 5 0.0 5.0 5.0\n7 -1 NaN NaN NaN\n8 -1 NaN NaN NaN\n9 -1 NaN NaN NaN\n10 5 3.0 2.0 5.0\n11 1 0.0 5.0 5.0\nRun Code Online (Sandbox Code Playgroud)\n\n(在问题中,您pos在第 10 行列出的是 4,但实际上应该是 2。)
团体人数 8 人:
\n\n>>> print(get_pos_neg_ratio(df['value'], 8)) \n value neg pos total\n0 -6 NaN NaN NaN\n1 -2 8.0 0.0 8.0\n2 -1 NaN NaN NaN\n3 1 NaN NaN NaN\n4 0 NaN NaN NaN\n5 1 NaN NaN NaN\n6 5 1.0 7.0 8.0\n7 -1 NaN NaN NaN\n8 -1 NaN NaN NaN\n9 -1 NaN NaN NaN\n10 5 3.0 5.0 8.0\n11 1 NaN NaN NaN\nRun Code Online (Sandbox Code Playgroud)\n\n团体人数 7 人:
\n\n>>> print(get_pos_neg_ratio(df['value'], 7))\n value neg pos total\n0 -6 NaN NaN NaN\n1 -2 7.0 0.0 7.0\n2 -1 NaN NaN NaN\n3 1 NaN NaN NaN\n4 0 NaN NaN NaN\n5 1 NaN NaN NaN\n6 5 2.0 5.0 7.0\n7 -1 NaN NaN NaN\n8 -1 NaN NaN NaN\n9 -1 NaN NaN NaN\n10 5 3.0 4.0 7.0\n11 1 NaN NaN NaN\nRun Code Online (Sandbox Code Playgroud)\n