Sam*_*old 7 python time-series distinct-values pandas rolling-computation
我有一系列时间访问建筑物的人.每个人都有一个唯一的身份证.对于时间序列中的每条记录,我想知道过去365天内访问建筑物的唯一人数(即滚动的唯一人数,窗口为365天).
pandas似乎没有用于此计算的内置方法.当存在大量唯一访问者和/或大窗口时,计算变得计算密集.(实际数据大于此示例.)
有没有比我在下面做的更好的计算方法?我不确定为什么我制作的快速方法windowed_nunique(在"速度测试3"下)偏离1.
谢谢你的帮助!
相关链接:
pandas问题:https://github.com/pandas-dev/pandas/issues/14336In [1]:
# Import libraries.
import pandas as pd
import numba
import numpy as np
Run Code Online (Sandbox Code Playgroud)
In [2]:
# Create data of people visiting a building.
np.random.seed(seed=0)
dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')
window = 365 # days
num_pids = 100
probs = np.linspace(start=0.001, stop=0.1, num=num_pids)
df = pd\
.DataFrame(
data=[(date, pid)
for (pid, prob) in zip(range(num_pids), probs)
for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],
columns=['Date', 'PersonId'])\
.sort_values(by='Date')\
.reset_index(drop=True)
print("Created data of people visiting a building:")
df.head() # 9181 rows × 2 columns
Run Code Online (Sandbox Code Playgroud)
Out[2]:
Created data of people visiting a building:
| | Date | PersonId |
|---|------------|----------|
| 0 | 2010-01-01 | 76 |
| 1 | 2010-01-01 | 63 |
| 2 | 2010-01-01 | 89 |
| 3 | 2010-01-01 | 81 |
| 4 | 2010-01-01 | 7 |
Run Code Online (Sandbox Code Playgroud)
In [3]:
%%timeit
# This counts the number of people visiting the building, not the number of unique people.
# Provided as a speed reference.
df.rolling(window='{:d}D'.format(window), on='Date').count()
Run Code Online (Sandbox Code Playgroud)
3.32 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]:
%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())
Run Code Online (Sandbox Code Playgroud)
2.42 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]:
# Save results as a reference to check calculation accuracy.
ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values
Run Code Online (Sandbox Code Playgroud)
In [6]:
# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def nunique(arr):
return len(set(arr))
Run Code Online (Sandbox Code Playgroud)
In [7]:
%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)
Run Code Online (Sandbox Code Playgroud)
430 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]:
# Check accuracy of results.
test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values
assert all(ref == test)
Run Code Online (Sandbox Code Playgroud)
In [9]:
# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique(dates, pids, window):
r"""Track number of unique persons in window,
reading through arrays only once.
Args:
dates (numpy.ndarray): Array of dates as number of days since epoch.
pids (numpy.ndarray): Array of integer person identifiers.
window (int): Width of window in units of difference of `dates`.
Returns:
ucts (numpy.ndarray): Array of unique counts.
Raises:
AssertionError: Raised if `len(dates) != len(pids)`
Notes:
* May be off by 1 compared to `pandas.core.window.Rolling`
with a time series alias offset.
"""
# Check arguments.
assert dates.shape == pids.shape
# Initialize counters.
idx_min = 0
idx_max = dates.shape[0]
date_min = dates[idx_min]
pid_min = pids[idx_min]
pid_max = np.max(pids)
pid_cts = np.zeros(pid_max, dtype=np.int64)
pid_cts[pid_min] = 1
uct = 1
ucts = np.zeros(idx_max, dtype=np.int64)
ucts[idx_min] = uct
idx = 1
# For each (date, person)...
while idx < idx_max:
# If person count went from 0 to 1, increment unique person count.
date = dates[idx]
pid = pids[idx]
pid_cts[pid] += 1
if pid_cts[pid] == 1:
uct += 1
# For past dates outside of window...
while (date - date_min) > window:
# If person count went from 1 to 0, decrement unique person count.
pid_cts[pid_min] -= 1
if pid_cts[pid_min] == 0:
uct -= 1
idx_min += 1
date_min = dates[idx_min]
pid_min = pids[idx_min]
# Record unique person count.
ucts[idx] = uct
idx += 1
return ucts
Run Code Online (Sandbox Code Playgroud)
In [10]:
# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)
Run Code Online (Sandbox Code Playgroud)
In [11]:
%%timeit
windowed_nunique(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
Run Code Online (Sandbox Code Playgroud)
107 µs ± 63.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]:
# Check accuracy of results.
test = windowed_nunique(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
# Note: Method may be off by 1.
assert all(np.isclose(ref, np.asarray(test), atol=1))
Run Code Online (Sandbox Code Playgroud)
In [13]:
# Show where the calculation doesn't match.
print("Where reference ('ref') calculation of number of unique people doesn't match 'test':")
df['ref'] = ref
df['test'] = test
df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns
Run Code Online (Sandbox Code Playgroud)
Out[13]:
Where reference ('ref') calculation of number of unique people doesn't match 'test':
| | Date | PersonId | DateEpoch | ref | test |
|----|------------|----------|-----------|------|------|
| 78 | 2010-01-19 | 99 | 14628 | 56.0 | 55 |
| 79 | 2010-01-19 | 96 | 14628 | 56.0 | 55 |
| 80 | 2010-01-19 | 88 | 14628 | 56.0 | 55 |
| 81 | 2010-01-20 | 94 | 14629 | 56.0 | 55 |
| 82 | 2010-01-20 | 48 | 14629 | 57.0 | 56 |
Run Code Online (Sandbox Code Playgroud)
我在快速方法中有两个错误windowed_nunique,现在在下面更正windowed_nunique_corrected:
pid_cts。 date_min因此应在 时更新(date - date_min + 1) > window。相关链接:
\n\nIn [14]:
# Define a custom function and implement a just-in-time compiler.\n@numba.jit(nopython=True)\ndef windowed_nunique_corrected(dates, pids, window):\n r"""Track number of unique persons in window,\n reading through arrays only once.\n\n Args:\n dates (numpy.ndarray): Array of dates as number of days since epoch.\n pids (numpy.ndarray): Array of integer person identifiers.\n Required: min(pids) >= 0\n window (int): Width of window in units of difference of `dates`.\n Required: window >= 1\n\n Returns:\n ucts (numpy.ndarray): Array of unique counts.\n\n Raises:\n AssertionError: Raised if not...\n * len(dates) == len(pids)\n * min(pids) >= 0\n * window >= 1\n\n Notes:\n * Matches `pandas.core.window.Rolling`\n with a time series alias offset.\n\n """\n\n # Check arguments.\n assert len(dates) == len(pids)\n assert np.min(pids) >= 0\n assert window >= 1\n\n # Initialize counters.\n idx_min = 0\n idx_max = dates.shape[0]\n date_min = dates[idx_min]\n pid_min = pids[idx_min]\n pid_max = np.max(pids) + 1\n pid_cts = np.zeros(pid_max, dtype=np.int64)\n pid_cts[pid_min] = 1\n uct = 1\n ucts = np.zeros(idx_max, dtype=np.int64)\n ucts[idx_min] = uct\n idx = 1\n\n # For each (date, person)...\n while idx < idx_max:\n\n # Lookup date, person.\n date = dates[idx]\n pid = pids[idx]\n\n # If person count went from 0 to 1, increment unique person count.\n pid_cts[pid] += 1\n if pid_cts[pid] == 1:\n uct += 1\n\n # For past dates outside of window...\n # Note: If window=3, it includes day0,day1,day2.\n while (date - date_min + 1) > window:\n\n # If person count went from 1 to 0, decrement unique person count.\n pid_cts[pid_min] -= 1\n if pid_cts[pid_min] == 0:\n uct -= 1\n idx_min += 1\n date_min = dates[idx_min]\n pid_min = pids[idx_min]\n\n # Record unique person count.\n ucts[idx] = uct\n idx += 1\n\n return ucts\nRun Code Online (Sandbox Code Playgroud)\n\nIn [15]:
# Cast dates to integers.\ndf[\'DateEpoch\'] = (df[\'Date\'] - pd.to_datetime(\'1970-01-01\'))/pd.to_timedelta(1, unit=\'D\')\ndf[\'DateEpoch\'] = df[\'DateEpoch\'].astype(int)\nRun Code Online (Sandbox Code Playgroud)\n\nIn [16]:
%%timeit\nwindowed_nunique_corrected(\n dates=df[\'DateEpoch\'].values,\n pids=df[\'PersonId\'].values,\n window=window)\nRun Code Online (Sandbox Code Playgroud)\n\n98.8 \xc2\xb5s \xc2\xb1 41.3 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)
In [17]:
# Check accuracy of results.\ntest = windowed_nunique_corrected(\n dates=df[\'DateEpoch\'].values,\n pids=df[\'PersonId\'].values,\n window=window)\nassert all(ref == test)\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
3272 次 |
| 最近记录: |