Nil*_*age 8 python timedelta dataframe pandas
我正在分析Apache日志文件,我已将其导入到pandas数据帧中.
'65 .55.52.118 - - [30/May/2013:06:58:52 -0600]"GET /detailedAddVen.php?refId=7954&uId=2802 HTTP/1.1"200 4514" - ""Mozilla/5.0(兼容; bingbot) /2.0; + http://www.bing.com/bingbot.htm)"'
我的数据帧:

我想根据IP,代理和时差将其分组到会话中(如果持续时间大于30分钟则应该是新会话).
通过IP和Agent很容易对数据帧进行分组,但是如何检查这个时间差?希望问题很清楚.
sessions = df.groupby(['IP', 'Agent']).size()
Run Code Online (Sandbox Code Playgroud)
更新:df.index如下:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-30 06:00:41, ..., 2013-05-30 22:29:14]
Length: 31975, Freq: None, Timezone: None
Run Code Online (Sandbox Code Playgroud)
And*_*den 12
我会用a shift和a 来做这个cumsum(这是一个简单的例子,用数字而不是时间 - 但它们的工作方式完全相同):
In [11]: s = pd.Series([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])
In [12]: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False) # *
Out[12]:
0 0
1 0
2 0
3 1
4 1
5 2
6 2
dtype: int64
Run Code Online (Sandbox Code Playgroud)
*需要skipna = False似乎是一个错误.
然后你可以在groupby中apply使用它:
In [21]: df = pd.DataFrame([[1.1, 1.7, 2.5, 2.6, 2.7, 3.4], list('AAABBB')]).T
In [22]: df.columns = ['time', 'ip']
In [23]: df
Out[23]:
time ip
0 1.1 A
1 1.7 A
2 2.5 A
3 2.6 B
4 2.7 B
5 3.4 B
In [24]: g = df.groupby('ip')
In [25]: df['session_number'] = g['time'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False))
In [26]: df
Out[26]:
time ip session_number
0 1.1 A 0
1 1.7 A 1
2 2.5 A 2
3 2.6 B 0
4 2.7 B 0
5 3.4 B 1
Run Code Online (Sandbox Code Playgroud)
现在你可以分组'ip'和'session_number'(并分析每个会话).
Andy Hayden 的回答既可爱又简洁,但如果您有大量用户/IP 地址要分组,它会变得非常慢。这是另一种更丑陋但也更快的方法。
\n\nimport pandas as pd\nimport numpy as np\n\nsample = lambda x: np.random.choice(x, size=10000)\ndf = pd.DataFrame({'ip': sample(range(500)), \n 'time': sample([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])})\nmax_diff = 0.5 # Max time difference\n\ndef method_1(df):\n df = df.sort_values('time')\n g = df.groupby('ip')\n df['session'] = g['time'].apply(\n lambda s: (s - s.shift(1) > max_diff).fillna(0).cumsum(skipna=False)\n )\n return df['session']\n\n\ndef method_2(df):\n # Sort by ip then time \n df = df.sort_values(['ip', 'time'])\n\n # Get locations where the ip changes \n ip_change = df.ip != df.ip.shift()\n time_or_ip_change = (df.time - df.time.shift() > max_diff) | ip_change\n df['session'] = time_or_ip_change.cumsum()\n\n # The cumsum operated over the whole series, so subtract out the first \n # value for each IP\n df['tmp'] = 0\n df.loc[ip_change, 'tmp'] = df.loc[ip_change, 'session']\n df['tmp'] = np.maximum.accumulate(df.tmp)\n df['session'] = df.session - df.tmp\n\n # Delete the temporary column\n del df['tmp']\n return df['session']\n\nr1 = method_1(df)\nr2 = method_2(df)\n\nassert (r1.sort_index() == r2.sort_index()).all()\n\n%timeit method_1(df)\n%timeit method_2(df)\n\n400 ms \xc2\xb1 195 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n11.6 ms \xc2\xb1 2.04 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\nRun Code Online (Sandbox Code Playgroud)\n