如何制作一个显示不同类型值比率的数据框?

may*_*ull 5 python pandas

我有一个看起来像这样的数据框:

>>> df
                        value
time
2020-01-31 07:59:43.232    -6
2020-01-31 07:59:43.232    -2
2020-01-31 07:59:43.232    -1
2020-01-31 07:59:43.264     1
2020-01-31 07:59:43.389     0
2020-01-31 07:59:43.466     1
2020-01-31 07:59:43.466     5
2020-01-31 07:59:43.466    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467     5
2020-01-31 07:59:43.467     1
Run Code Online (Sandbox Code Playgroud)

我想再添加 3 列,以某个数字显示正负值的比率。例如,如果数字是 8

                        value    neg     pos    total
time
2020-01-31 07:59:43.232    -6
2020-01-31 07:59:43.232    -2      8       0        8
2020-01-31 07:59:43.232    -1
2020-01-31 07:59:43.264     1
2020-01-31 07:59:43.389     0
2020-01-31 07:59:43.466     1
2020-01-31 07:59:43.466     5      1       7        8
2020-01-31 07:59:43.466    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467     5      3       5        8
2020-01-31 07:59:43.467     1
Run Code Online (Sandbox Code Playgroud)

如果数字是 5:

                        value    neg     pos    total
time
2020-01-31 07:59:43.232    -6      5       0        5    # take just 5 out of -6 and the rest(-1) is used for the next calculation
2020-01-31 07:59:43.232    -2      
2020-01-31 07:59:43.232    -1
2020-01-31 07:59:43.264     1      4       1        5    # sum(abs(list(-1, -2, -1, 1)))
2020-01-31 07:59:43.389     0
2020-01-31 07:59:43.466     1
2020-01-31 07:59:43.466     5      0       5        5    # 1 + 5 -> take just 5(1, 4) out of them and the rest(1) is used for the next calculation
2020-01-31 07:59:43.466    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467    -1
2020-01-31 07:59:43.467     5      3       4        5    # 1, -1, -1, -1, 5 -> take just 5(1, -1, -1, -1, 1) out of them and the rest(4) is used for the next calculation
2020-01-31 07:59:43.467     1      0       5        5    # 4, 1
Run Code Online (Sandbox Code Playgroud)

我一直在用循环和几个条件语句进行计算,而且速度很慢。我想知道是否有更有效和更快的方法来做到这一点。

下面的代码显示了当数字为 300(GROUP_SIZE)时我是如何做的

GROUP_SIZE = 300

for DATE in lst_requiredDates:

    df = dic_dtf[DATE]

    lst_groups = []
    lst_group = [0,  0,    0,    0]

    for index, row in df.iterrows():
        date        = index
        value      = row['value']
        abs_value  = abs(value)


        if (lst_group[3]+abs_value) < GROUP_SIZE:

            if value < 0:
                lst_group[0] = date
                lst_group[1] += abs_value
                lst_group[3] += abs_value
            else:
                lst_group[0] = date
                lst_group[2] += abs_value
                lst_group[3] += abs_value

        elif (lst_group[3]+abs_value) == GROUP_SIZE:

            if value < 0:
                lst_group[0] = date
                lst_group[1] += abs_value
                lst_group[3] += abs_value
            else:
                lst_group[0] = date
                lst_group[2] += abs_value
                lst_group[3] += abs_value

            lst_groups.append(lst_group)
            lst_group = [0,  0,    0,    0]


        elif (lst_group[3]+abs_value) > GROUP_SIZE:
            int_left = (lst_group[3]+abs_value) - GROUP_SIZE

            if value < 0:
                lst_group[0] = date
                lst_group[1] += (abs_value - int_left)
                lst_group[3] += (abs_value - int_left)

                lst_groups.append(lst_group)
                lst_group = [0,  0,    0,    0]
                lst_group[0] = date
                lst_group[1] += int_left
                lst_group[3] += int_left
            else:
                lst_group[0] = date
                lst_group[2] += (abs_value - int_left)
                lst_group[3] += (abs_value - int_left)

                lst_groups.append(lst_group)
                lst_group = [0,  0,    0,    0]
                lst_group[0] = date
                lst_group[2] += int_left
                lst_group[3] += int_left
Run Code Online (Sandbox Code Playgroud)

fil*_*den 3

这是一次对整个数据帧使用操作的解决方案,这应该非常有效。

\n\n

我使用了cumsum()两次,一次针对绝对值,以查找何时达到组大小,另一次针对值本身,我们稍后可以使用它来查找negpos

\n\n

一种用途是shift()查找组边界,其中包含我们要更新的行以及计算总和所需的所有数据。

\n\n

处理余数并不太难,查看绝对值的累积和并给出最后一个值的符号。

\n\n

这里有符号值的累积和就派上用场了。pos - neg调整余数后,我们可以通过与前一行的差来找到当前行的总和。

\n\n

知道它们加起来等于组大小,我们可以轻松计算这两个单独的值并将它们添加到数据框中。

\n\n

代码如下,有评论进一步解释这一切:

\n\n
import pandas as pd\nimport numpy as np\n\ndef get_pos_neg_ratio(series, group_size):\n    df = series.rename('value').to_frame()\n\n    # Calculate the cumulative sum and the cumulative sum\n    # of absolute values. The latter will be used to break\n    # the series into groups.\n    df_aux = df.copy()\n    df_aux['cumsum'] = df['value'].cumsum()\n    df_aux['cumabs'] = abs(df['value']).cumsum()\n    df_aux['group'] = df_aux['cumabs'] // group_size\n\n    # Break it into groups, by locating the boundaries.\n    df_aux = df_aux[\n        df_aux['group'] != df_aux['group'].shift(fill_value=0)\n    ].copy()\n\n    # Calculate the remainder on each boundary row. Give\n    # it the sign of the value in that row, since that\n    # value is the one that got it over the group size.\n    df_aux['remainder'] = (\n        (df_aux['cumabs'] % group_size) *\n        np.sign(df_aux['value'])\n    )\n\n    # Adjust the sums.by the remainder.\n    df_aux['adjsum'] = df_aux['cumsum'] - df_aux['remainder']\n\n    # Finally, find the individual sums by subtracting\n    # from the adjusted cumulative sum from the previous\n    # group. This will be the total sum of positives and\n    # negatives for this group.\n    df_aux['grpsum'] = (\n        df_aux['adjsum'] -\n        df_aux['adjsum'].shift(fill_value=0)\n    )\n\n    # Now we can calculate positives and negatives. We\n    # know that their absolute values sum up to group_size\n    # and that they sum up to `adjsum`, so a little bit of\n    # algebra will get us to:\n    df['neg'] = (group_size - df_aux['grpsum']) // 2\n    df['pos'] = (group_size + df_aux['grpsum']) // 2\n    df['total'] = df['neg'] + df['pos']\n    return df\n
Run Code Online (Sandbox Code Playgroud)\n\n

向函数传递一个 Series (即一列)和一个组大小,它将返回一个包含列(在 name 下value)和计算出的neg,pos和 的DataFrame total

\n\n

还要注意的是,这个函数需要一个不重复的索引!否则最终的任务将会失败。我建议您reset_index()首先使用,转换time为常规列,然后可能稍后再set_index()返回。

\n\n

如果一个值使我们同时超过两个组边界,则此代码将被破坏。因此对于示例数据,它将中断group_size \xe2\x89\xa4 4。也许可以针对这种情况进行修复(我们可以检测到在边界处跳过一组),但目前还不清楚如何处理这些情况,我们是否应该插入一个值为 NaN 的新行,并重复索引额外的组?

\n\n

由于您在提供的示例中没有提到这种情况,并且您的示例代码使用了 300 的大组大小,因此我想这很可能不是您非常关心的事情,并且当前的方法足够合适。

\n\n

需要考虑的另一点是,我们不会保留最后一个不完整组的最终总和,如果我们正在流式传输数据或连接 DataFrame,我们无法真正知道还剩下多少可以帮助我们继续计算的方式。

\n\n

同样,由于在您的组大小为 8 的示例中,您似乎没有提到最后一行中 1 的余数,我相信这也不是您真正关心的问题。

\n\n

对数据进行示例运行(重置索引后),组大小为 5:

\n\n
>>> df = df.reset_index()\n>>> print(get_pos_neg_ratio(df['value'], 5))\n    value  neg  pos  total\n0      -6  5.0  0.0    5.0\n1      -2  NaN  NaN    NaN\n2      -1  NaN  NaN    NaN\n3       1  4.0  1.0    5.0\n4       0  NaN  NaN    NaN\n5       1  NaN  NaN    NaN\n6       5  0.0  5.0    5.0\n7      -1  NaN  NaN    NaN\n8      -1  NaN  NaN    NaN\n9      -1  NaN  NaN    NaN\n10      5  3.0  2.0    5.0\n11      1  0.0  5.0    5.0\n
Run Code Online (Sandbox Code Playgroud)\n\n

(在问题中,您pos在第 10 行列出的是 4,但实际上应该是 2。)

\n\n

团体人数 8 人:

\n\n
>>> print(get_pos_neg_ratio(df['value'], 8)) \n    value  neg  pos  total\n0      -6  NaN  NaN    NaN\n1      -2  8.0  0.0    8.0\n2      -1  NaN  NaN    NaN\n3       1  NaN  NaN    NaN\n4       0  NaN  NaN    NaN\n5       1  NaN  NaN    NaN\n6       5  1.0  7.0    8.0\n7      -1  NaN  NaN    NaN\n8      -1  NaN  NaN    NaN\n9      -1  NaN  NaN    NaN\n10      5  3.0  5.0    8.0\n11      1  NaN  NaN    NaN\n
Run Code Online (Sandbox Code Playgroud)\n\n

团体人数 7 人:

\n\n
>>> print(get_pos_neg_ratio(df['value'], 7))\n    value  neg  pos  total\n0      -6  NaN  NaN    NaN\n1      -2  7.0  0.0    7.0\n2      -1  NaN  NaN    NaN\n3       1  NaN  NaN    NaN\n4       0  NaN  NaN    NaN\n5       1  NaN  NaN    NaN\n6       5  2.0  5.0    7.0\n7      -1  NaN  NaN    NaN\n8      -1  NaN  NaN    NaN\n9      -1  NaN  NaN    NaN\n10      5  3.0  4.0    7.0\n11      1  NaN  NaN    NaN\n
Run Code Online (Sandbox Code Playgroud)\n