如何有效地计算熊猫时间序列中的滚动唯一计数？

Sam*_*old 7 python time-series distinct-values pandas rolling-computation

我有一系列时间访问建筑物的人.每个人都有一个唯一的身份证.对于时间序列中的每条记录,我想知道过去365天内访问建筑物的唯一人数(即滚动的唯一人数,窗口为365天).

pandas似乎没有用于此计算的内置方法.当存在大量唯一访问者和/或大窗口时,计算变得计算密集.(实际数据大于此示例.)

有没有比我在下面做的更好的计算方法？我不确定为什么我制作的快速方法windowed_nunique(在"速度测试3"下)偏离1.

谢谢你的帮助!

相关链接:

来源Jupyter笔记本:https://gist.github.com/stharrold/17589e6809d249942debe3a5c43d38cc
相关pandas问题:https://github.com/pandas-dev/pandas/issues/14336

初始化

In [1]:

# Import libraries.
import pandas as pd
import numba
import numpy as np

Run Code Online (Sandbox Code Playgroud)

In [2]:

# Create data of people visiting a building.

np.random.seed(seed=0)
dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')
window = 365 # days
num_pids = 100
probs = np.linspace(start=0.001, stop=0.1, num=num_pids)

df = pd\
    .DataFrame(
        data=[(date, pid)
              for (pid, prob) in zip(range(num_pids), probs)
              for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],
        columns=['Date', 'PersonId'])\
    .sort_values(by='Date')\
    .reset_index(drop=True)

print("Created data of people visiting a building:")
df.head() # 9181 rows × 2 columns

Run Code Online (Sandbox Code Playgroud)

Out[2]:

Created data of people visiting a building:

|   | Date       | PersonId | 
|---|------------|----------| 
| 0 | 2010-01-01 | 76       | 
| 1 | 2010-01-01 | 63       | 
| 2 | 2010-01-01 | 89       | 
| 3 | 2010-01-01 | 81       | 
| 4 | 2010-01-01 | 7        |

Run Code Online (Sandbox Code Playgroud)

速度参考

In [3]:

%%timeit
# This counts the number of people visiting the building, not the number of unique people.
# Provided as a speed reference.
df.rolling(window='{:d}D'.format(window), on='Date').count()

Run Code Online (Sandbox Code Playgroud)

3.32 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

速度测试1

In [4]:

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())

Run Code Online (Sandbox Code Playgroud)

2.42 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]:

# Save results as a reference to check calculation accuracy.
ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values

Run Code Online (Sandbox Code Playgroud)

速度测试2

In [6]:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def nunique(arr):
    return len(set(arr))

Run Code Online (Sandbox Code Playgroud)

In [7]:

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)

Run Code Online (Sandbox Code Playgroud)

430 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]:

# Check accuracy of results.
test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values
assert all(ref == test)

Run Code Online (Sandbox Code Playgroud)

速度测试3

In [9]:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique(dates, pids, window):
    r"""Track number of unique persons in window,
    reading through arrays only once.

    Args:
        dates (numpy.ndarray): Array of dates as number of days since epoch.
        pids (numpy.ndarray): Array of integer person identifiers.
        window (int): Width of window in units of difference of `dates`.

    Returns:
        ucts (numpy.ndarray): Array of unique counts.

    Raises:
        AssertionError: Raised if `len(dates) != len(pids)`

    Notes:
        * May be off by 1 compared to `pandas.core.window.Rolling`
            with a time series alias offset.

    """

    # Check arguments.
    assert dates.shape == pids.shape

    # Initialize counters.
    idx_min = 0
    idx_max = dates.shape[0]
    date_min = dates[idx_min]
    pid_min = pids[idx_min]
    pid_max = np.max(pids)
    pid_cts = np.zeros(pid_max, dtype=np.int64)
    pid_cts[pid_min] = 1
    uct = 1
    ucts = np.zeros(idx_max, dtype=np.int64)
    ucts[idx_min] = uct
    idx = 1

    # For each (date, person)...
    while idx < idx_max:

        # If person count went from 0 to 1, increment unique person count.
        date = dates[idx]
        pid = pids[idx]
        pid_cts[pid] += 1
        if pid_cts[pid] == 1:
            uct += 1

        # For past dates outside of window...
        while (date - date_min) > window:

            # If person count went from 1 to 0, decrement unique person count.
            pid_cts[pid_min] -= 1
            if pid_cts[pid_min] == 0:
                uct -= 1
            idx_min += 1
            date_min = dates[idx_min]
            pid_min = pids[idx_min]

        # Record unique person count.
        ucts[idx] = uct
        idx += 1

    return ucts

Run Code Online (Sandbox Code Playgroud)

In [10]:

# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)

Run Code Online (Sandbox Code Playgroud)

In [11]:

%%timeit
windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)

Run Code Online (Sandbox Code Playgroud)

107 µs ± 63.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]:

# Check accuracy of results.
test = windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
# Note: Method may be off by 1.
assert all(np.isclose(ref, np.asarray(test), atol=1))

Run Code Online (Sandbox Code Playgroud)

In [13]:

# Show where the calculation doesn't match.
print("Where reference ('ref') calculation of number of unique people doesn't match 'test':")
df['ref'] = ref
df['test'] = test
df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns

Run Code Online (Sandbox Code Playgroud)

Out[13]:

Where reference ('ref') calculation of number of unique people doesn't match 'test':

|    | Date       | PersonId | DateEpoch | ref  | test | 
|----|------------|----------|-----------|------|------| 
| 78 | 2010-01-19 | 99       | 14628     | 56.0 | 55   | 
| 79 | 2010-01-19 | 96       | 14628     | 56.0 | 55   | 
| 80 | 2010-01-19 | 88       | 14628     | 56.0 | 55   | 
| 81 | 2010-01-20 | 94       | 14629     | 56.0 | 55   | 
| 82 | 2010-01-20 | 48       | 14629     | 57.0 | 56   |

Run Code Online (Sandbox Code Playgroud)

我在快速方法中有两个错误windowed_nunique，现在在下面更正windowed_nunique_corrected：

\n\n

用于记忆窗口内每个人 ID 的唯一计数的数组的大小太小pid_cts。
由于窗口的前缘和后缘包括整数天，date_min因此应在时更新(date - date_min + 1) > window。

\n\n

相关链接：

\n\n

源 Jupyter Notebook 更新了解决方案： https://gist.github.com/stharrold/17589e6809d249942debe3a5c43d38cc

\n\n

In [14]:

\n\n

# Define a custom function and implement a just-in-time compiler.\n@numba.jit(nopython=True)\ndef windowed_nunique_corrected(dates, pids, window):\n    r"""Track number of unique persons in window,\n    reading through arrays only once.\n\n    Args:\n        dates (numpy.ndarray): Array of dates as number of days since epoch.\n        pids (numpy.ndarray): Array of integer person identifiers.\n            Required: min(pids) >= 0\n        window (int): Width of window in units of difference of `dates`.\n            Required: window >= 1\n\n    Returns:\n        ucts (numpy.ndarray): Array of unique counts.\n\n    Raises:\n        AssertionError: Raised if not...\n            * len(dates) == len(pids)\n            * min(pids) >= 0\n            * window >= 1\n\n    Notes:\n        * Matches `pandas.core.window.Rolling`\n            with a time series alias offset.\n\n    """\n\n    # Check arguments.\n    assert len(dates) == len(pids)\n    assert np.min(pids) >= 0\n    assert window >= 1\n\n    # Initialize counters.\n    idx_min = 0\n    idx_max = dates.shape[0]\n    date_min = dates[idx_min]\n    pid_min = pids[idx_min]\n    pid_max = np.max(pids) + 1\n    pid_cts = np.zeros(pid_max, dtype=np.int64)\n    pid_cts[pid_min] = 1\n    uct = 1\n    ucts = np.zeros(idx_max, dtype=np.int64)\n    ucts[idx_min] = uct\n    idx = 1\n\n    # For each (date, person)...\n    while idx < idx_max:\n\n        # Lookup date, person.\n        date = dates[idx]\n        pid = pids[idx]\n\n        # If person count went from 0 to 1, increment unique person count.\n        pid_cts[pid] += 1\n        if pid_cts[pid] == 1:\n            uct += 1\n\n        # For past dates outside of window...\n        # Note: If window=3, it includes day0,day1,day2.\n        while (date - date_min + 1) > window:\n\n            # If person count went from 1 to 0, decrement unique person count.\n            pid_cts[pid_min] -= 1\n            if pid_cts[pid_min] == 0:\n                uct -= 1\n            idx_min += 1\n            date_min = dates[idx_min]\n            pid_min = pids[idx_min]\n\n        # Record unique person count.\n        ucts[idx] = uct\n        idx += 1\n\n    return ucts\n

Run Code Online (Sandbox Code Playgroud)\n\n

In [15]:

\n\n

# Cast dates to integers.\ndf[\'DateEpoch\'] = (df[\'Date\'] - pd.to_datetime(\'1970-01-01\'))/pd.to_timedelta(1, unit=\'D\')\ndf[\'DateEpoch\'] = df[\'DateEpoch\'].astype(int)\n

Run Code Online (Sandbox Code Playgroud)\n\n

In [16]:

\n\n

%%timeit\nwindowed_nunique_corrected(\n    dates=df[\'DateEpoch\'].values,\n    pids=df[\'PersonId\'].values,\n    window=window)\n

Run Code Online (Sandbox Code Playgroud)\n\n

98.8 \xc2\xb5s \xc2\xb1 41.3 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)

\n\n

In [17]:

\n\n

# Check accuracy of results.\ntest = windowed_nunique_corrected(\n    dates=df[\'DateEpoch\'].values,\n    pids=df[\'PersonId\'].values,\n    window=window)\nassert all(ref == test)\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	8 年，5 月前
查看次数：	3272 次
最近记录：	7 年前

如何在Python中获取日期的星期几？ 519

为什么Python使用'魔术方法'？ 97

ipython读错了python版本 97

如何在Python 3.10中的匹配（其他语言中切换）案例中使用多个案例 78

Python setuptools:如何在install_requires下列出私有存储库？ 65

使用Python和Flask返回API错误消息 47

如何直接从文件系统加载jinja模板 46

在 Juypyter Notebook 中将 PySpark Dataframe 显示为 HTML 表 8

如何*不*在ipython笔记本中显示'NaN'(pandas dataframe的html表)？ 6

R中GARCH的模拟 5

如何在本地和远程删除Git分支？ 16311

仅存储使用Git更改的多个文件中的一个文件？ 2895

如何循环或枚举JavaScript对象？ 2704

为什么模板只能在头文件中实现？ 1660

如何使用逗号作为千位分隔符在JavaScript中打印数字 1589

用64位替换32位循环计数器会引入疯狂的性能偏差 1370

检查jQuery中是否存在元素 1209

JavaScript中是否有常量？ 1118

同步检查Node.js中是否存在文件/目录 1113

用于Python的IDE是什么？ 1028