在Python pandas中自定义rolling_apply函数

Max*_*sky 3 python group-by aggregate data-analysis pandas

建立

我有一个包含三列的DataFrame:

"类别"包含True和False,我已df.groupby('Category')按照这些值进行分组.
"时间"包含已记录值的时间戳(以秒为单位)
"值"包含值本身.

在每个时间实例,记录两个值:一个具有"True"类别,另一个具有"False"类别.

滚动申请问题

在每个类别组中,我想计算一个数字并将其存储在每次结果列中.结果是时间t-60与t介于1和3之间的值的百分比.

实现此目的的最简单方法可能是计算该时间间隔内的值的总数rolling_count,然后执行rolling_apply以仅计算该间隔中介于1和3之间的值.

到目前为止,这是我的代码:

groups = df.groupby(['Category'])
for key, grp in groups:
    grp = grp.reindex(grp['Time']) # reindex by time so we can count with rolling windows
    grp['total'] = pd.rolling_count(grp['Value'], window=60) # count number of values in the last 60 seconds
    grp['in_interval'] = ? ## Need to count number of values where 1<v<3 in the last 60 seconds

    grp['Result'] = grp['in_interval'] / grp['total'] # percentage of values between 1 and 3 in the last 60 seconds

Run Code Online (Sandbox Code Playgroud)

rolling_apply()找到正确的电话是grp['in_interval']什么？

让我们通过一个例子:

import pandas as pd
import numpy as np
np.random.seed(1)

def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True]*N + [False]*N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a,b))
        })
    return df

df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)

Run Code Online (Sandbox Code Playgroud)

所以DataFrame df看起来像这样:

In [4]: df
Out[4]: 
   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.400000
7      True  41.467287      7  0.333333
8      True  47.612097      8  0.285714
0      True  50.042641      0  0.250000
9      True  64.658008      9  0.125000
1      True  86.438939      1  0.333333

Run Code Online (Sandbox Code Playgroud)

现在,复制@herrfz,让我们来定义

def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage

Run Code Online (Sandbox Code Playgroud)

between(1,3)是一个函数,它将一个Series作为输入,并返回其半开区间中元素的分数[1,3).例如,

In [9]: series = pd.Series([1,2,3,4,5])

In [10]: between(1,3)(series)
Out[10]: 0.4

Run Code Online (Sandbox Code Playgroud)

现在我们将采用我们的DataFrame df,并分组Category:

df.groupby(['Category'])

Run Code Online (Sandbox Code Playgroud)

对于groupby对象中的每个组,我们将要应用一个函数:

df['Result'] = df.groupby(['Category']).apply(toeach_category)

Run Code Online (Sandbox Code Playgroud)

该函数toeach_category将以(子)DataFrame作为输入,并返回DataFrame作为输出.整个结果将分配给一个新的df被调用列Result.

现在究竟必须toeach_category做什么？如果我们这样写toeach_category:

def toeach_category(subf):
    print(subf)

Run Code Online (Sandbox Code Playgroud)

然后我们看到每个subf都是一个像这样的DataFrame(当时Category为False):

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333

Run Code Online (Sandbox Code Playgroud)

我们想要使用Times列,并且每次都应用一个函数.这完成了applymap:

def toeach_category(subf):
    result = subf[['Time']].applymap(percentage)

Run Code Online (Sandbox Code Playgroud)

该函数percentage将时间值作为输入,并返回一个值作为输出.值将是值在1和3之间的行的分数applymap非常严格:percentage不能采用任何其他参数.

给定时间t,我们可以使用以下方法从半开时间间隔中选择Values :subf(t-60, t]ix

subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value']

Run Code Online (Sandbox Code Playgroud)

所以我们可以Values通过申请找到1到3之间的百分比between(1,3):

between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

Run Code Online (Sandbox Code Playgroud)

现在请记住,我们需要一个作为输入的函数percentage,t并将上面的表达式作为输出返回:

def percentage(t):
    return between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

Run Code Online (Sandbox Code Playgroud)

但是请注意,percentage取决于subf,我们是不允许传递subf到percentage作为参数(同样,因为applymap是很严格).

那么我们如何摆脱这种干扰呢？解决方案是定义percentage内部toeach_category.Python的范围规则说,subf首先在Local范围内查找一个简单的名称,然后是Enclosing范围,Global范围,最后是在Builtin范围内.当percentage(t)调用和Python遇到时subf,Python首先在Local范围内查找值的值subf.由于subf不是本地变量percentage,Python在函数的Enclosing范围内查找它toeach_category.它找到了subf.完善.这正是我们所需要的.

所以现在我们有了我们的功能toeach_category:

def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result

Run Code Online (Sandbox Code Playgroud)

把它们放在一起,

import pandas as pd
import numpy as np
np.random.seed(1)


def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True] * N + [False] * N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a, b))
    })
    return df


def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage


def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result


df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)
df['Result'] = df.groupby(['Category']).apply(toeach_category)
print(df)

Run Code Online (Sandbox Code Playgroud)

产量

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.200000
17    False  41.467287      7  0.166667
18    False  47.612097      8  0.142857
10    False  50.042641      0  0.125000
19    False  64.658008      9  0.000000
11    False  86.438939      1  0.166667
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.200000
7      True  41.467287      7  0.166667
8      True  47.612097      8  0.142857
0      True  50.042641      0  0.125000
9      True  64.658008      9  0.000000
1      True  86.438939      1  0.166667

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，2 月前
查看次数：	5163 次
最近记录：	13 年，2 月前

从与条件匹配的iterable中获取第一个项目 264

发电机输出长度 117

展平列表清单 86

如何将数据从numpy数组复制到另一个数组 77

Python的字符串连接与str.join的速度有多慢？ 61

使用 level 获取多索引 Pandas DataFrame 的最小值的索引 7

python pandas dataframe从其他列的单元格创建新列 7

将SAS数据文件导入python数据框 5

如何选择记录包含GROUP(group by)中的MAX(some_field) 2

Itertools groupby: group list of lists by first two values of sublists 0

是否有唯一的Android设备ID？ 2645

.prop()vs .attr() 2249

Git获取远程分支 2088

在JavaScript中检测"无效日期"日期实例 1381

如何在Git中有选择地合并或选择来自另一个分支的更改？ 1374

如何通过curl调用使用HTTP请求发送标头？ 1328

检查Bash shell脚本中是否存在输入参数 1223

如何检查字符串是否包含Objective-C中的另一个字符串？ 1200

什么是(功能)反应式编程？ 1149

同步检查Node.js中是否存在文件/目录 1113