Mic*_*orn 7 python pandas scikit-learn pandas-groupby
我有一个看起来像这样的数据集:
date area_key total_units timeatend starthour timedifference vps
2020-01-15 08:22:39 0 9603 2020-01-15 16:32:39 8 29400.0 0.32663265306122446
2020-01-13 08:22:07 0 10273 2020-01-13 16:25:08 8 28981.0 0.35447362064801075
2020-01-23 07:16:55 3 5175 2020-01-23 14:32:44 7 26149.0 0.19790431756472524
2020-01-15 07:00:06 1 838 2020-01-15 07:46:29 7 2783.0 0.3011139058569889
2020-01-15 08:16:01 1 5840 2020-01-15 12:41:16 8 15915.0 0.3669494187873076
Run Code Online (Sandbox Code Playgroud)
然后将其计算到其中以创建 kmeans 集群。
date area_key total_units timeatend starthour timedifference vps
2020-01-15 08:22:39 0 9603 2020-01-15 16:32:39 8 29400.0 0.32663265306122446
2020-01-13 08:22:07 0 10273 2020-01-13 16:25:08 8 28981.0 0.35447362064801075
2020-01-23 07:16:55 3 5175 2020-01-23 14:32:44 7 26149.0 0.19790431756472524
2020-01-15 07:00:06 1 838 2020-01-15 07:46:29 7 2783.0 0.3011139058569889
2020-01-15 08:16:01 1 5840 2020-01-15 12:41:16 8 15915.0 0.3669494187873076
Run Code Online (Sandbox Code Playgroud)
我想要做的是使这些集群与时间段和键相关。
时间段 - 上午 7 点至上午 10 点、下午 4 点至下午 6 点、下午 12 点至下午 2 点、下午 6 点至上午 12 点、上午 12 点至上午 7 点、上午 10 点至下午 12 点、下午 2 点至下午 4 点和其他时间段一样。
并使用密钥 - 以编程方式显示每个集群的不同之处。
所需的结果将有一个类似于下表的表格,但请随意以您能想到的最佳方式开发它。时间段的意思是,说 1 会在早上 6 点之前,早上2 - 6 点到早上9 点,3 - 9到11 点,4 - 11到14 点等等。但是可以随意更改它 - 只是我的想法
我已经尝试了一些方法来使用groupby,但它似乎效果不佳 - 希望在这里得到一些指导。
此数据以个别事件为例。
DateTimeStamp VS_ID VS_Summary_Id Hostname Vehicle_speed Lane Length
11/01/2019 8:22 1 1 place_uno 65 2 71
11/01/2019 8:22 2 1 place_uno 59 1 375
11/01/2019 8:22 3 1 place_uno 59 1 389
11/01/2019 8:22 4 1 place_duo 59 1 832
11/01/2019 8:22 5 1 place_duo 52 1 409
Run Code Online (Sandbox Code Playgroud)
为了获得卷,我需要在较小的卷块中随时间聚合(15 秒或 15 分钟,将在下面发布代码)。
然后基本相同的想法。另一个贪婪的问题是 - 我将如何将速度插入到这个测量中?即,大量的体积,但低速度,也很好地迎合。
再次感谢惊人的帮助,将这样做以适应下面的代码,但是忘了链接它,它可以帮助使它更具体和更有价值。
谢谢你们!
第一个数据(注意:其他部分与更新有关)
数据非常有限,可能是由于简化它的复杂性,所以我将做一些假设并尽可能通用地编写它,以便您可以根据需要快速定制它。
假设:
group_divide_set_by_column)这样做可以让您分别调查每个小时窗口的车辆集群,并了解哪些集群区域更活跃并需要注意。
笔记:
HostName_key但它只是一个虚拟的,所以代码可以运行,它不一定有意义)。代码:
group_divide_set_by_column.这将允许我们按“hour_code”分组划分,然后按位置聚类。
def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
# Divide et by hours
divide_df_by_hours(df)
lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
# Divide to desired amount of clusters
for clusters_number in clusters_number_list:
create_cluster(group_df, clusters_number)
# Setting column types
set_colum_types(group_df)
return lst_df_by_groups
Run Code Online (Sandbox Code Playgroud)
hour为hour codes,类似于您的措辞方式:时间段的意思是,说 1 将在早上 6 点之前,2 - 6 am 到 9 am,3 - 9 到 11,4 - 11 到 14 等等。
def divide_df_by_hours(df):
def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
"""
Divide hours to groups:
Hours:
1-5 => 1
6-8 => 2
9-11 => 3
12-14 => 4
15-17 => 5
18-20 => 6
21+ => 7
"""
if h < start_threshold:
return 1
elif h >= end_threshold:
return (end_threshold // windows)
return h // windows
df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))
Run Code Online (Sandbox Code Playgroud)
set_colum_types将列转换为其匹配类型的函数:def set_colum_types(df):
types_dict = {
'Startdtm': 'datetime64[ns, Australia/Melbourne]',
'HostName_key': 'category',
'Totalvehicles': 'int32',
'Enddtm': 'datetime64[ns, Australia/Melbourne]',
'starthour': 'int32',
'timedelta': 'float',
'vehiclespersec': 'float',
}
for col, col_type in types_dict.items():
df[col] = df[col].astype(col_type)
Run Code Online (Sandbox Code Playgroud)
timeit装饰器来测量每个聚类的时间,因此减少了样板代码全码:
import functools
import pandas as pd
from timeit import default_timer as timer
import sklearn
from sklearn.cluster import KMeans
def timeit(func):
@functools.wraps(func)
def newfunc(*args, **kwargs):
startTime = timer()
func(*args, **kwargs)
elapsedTime = timer() - startTime
print('function [{}] finished in {} ms'.format(
func.__name__, int(elapsedTime * 1000)))
return newfunc
def set_colum_types(df):
types_dict = {
'Startdtm': 'datetime64[ns, Australia/Melbourne]',
'HostName_key': 'category',
'Totalvehicles': 'int32',
'Enddtm': 'datetime64[ns, Australia/Melbourne]',
'starthour': 'int32',
'timedelta': 'float',
'vehiclespersec': 'float',
}
for col, col_type in types_dict.items():
df[col] = df[col].astype(col_type)
@timeit
def create_cluster(df, clusters_number):
# Create K-Means model
model = KMeans(n_clusters=clusters_number, max_iter=600, random_state=9)
# Fetch location
# NOTE: Should be a *real* location, used another column as dummy
location_df = df[['HostName_key']]
kmeans = model.fit(location_df)
# Divide to clusters
df[f'kmeans_{clusters_number}'] = kmeans.labels_
def divide_df_by_hours(df):
def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
"""
Divide hours to groups:
Hours:
1-5 => 1
6-8 => 2
9-11 => 3
12-14 => 4
15-17 => 5
18-20 => 6
21+ => 7
"""
if h < start_threshold:
return 1
elif h >= end_threshold:
return (end_threshold // windows)
return h // windows
df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))
def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
# Divide et by hours
divide_df_by_hours(df)
lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
# Divide to desired amount of clusters
for clusters_number in clusters_number_list:
create_cluster(group_df, clusters_number)
# Setting column types
set_colum_types(group_df)
return lst_df_by_groups
# Load data
df = pd.read_csv('data.csv')
# Print data
print(df)
# Create clusters
lst_df_by_groups = create_clusters_by_group(df)
# For each hostname-key dataframe
for group_df_name, group_df in lst_df_by_groups.items():
print(f'Group {group_df_name} dataframe:')
print(group_df)
Run Code Online (Sandbox Code Playgroud)
示例输出:
Startdtm HostName_key ... timedelta vehiclespersec
0 2020-01-15 08:22:39 0 ... 29400.0 0.326633
1 2020-01-13 08:22:07 2 ... 28981.0 0.354474
2 2020-01-23 07:16:55 3 ... 26149.0 0.197904
3 2020-01-15 07:00:06 4 ... 2783.0 0.301114
4 2020-01-15 08:16:01 1 ... 15915.0 0.366949
5 2020-01-16 08:22:39 2 ... 29400.0 0.326633
6 2020-01-14 08:22:07 2 ... 28981.0 0.354479
7 2020-01-25 07:16:55 4 ... 26149.0 0.197904
8 2020-01-17 07:00:06 1 ... 2783.0 0.301114
9 2020-01-18 08:16:01 1 ... 15915.0 0.366949
[10 rows x 7 columns]
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
Group hour_code_0 dataframe:
Startdtm HostName_key ... kmeans_2 kmeans_3
0 2020-01-15 08:22:39+11:00 0 ... 1 1
1 2020-01-13 08:22:07+11:00 2 ... 0 0
2 2020-01-23 07:16:55+11:00 3 ... 0 2
[3 rows x 10 columns]
Group hour_code_1 dataframe:
Startdtm HostName_key ... kmeans_2 kmeans_3
3 2020-01-15 07:00:06+11:00 4 ... 1 1
4 2020-01-15 08:16:01+11:00 1 ... 0 0
5 2020-01-16 08:22:39+11:00 2 ... 0 2
[3 rows x 10 columns]
Group hour_code_2 dataframe:
Startdtm HostName_key ... kmeans_2 kmeans_3
6 2020-01-14 08:22:07+11:00 2 ... 1 2
7 2020-01-25 07:16:55+11:00 4 ... 0 0
8 2020-01-17 07:00:06+11:00 1 ... 1 1
9 2020-01-18 08:16:01+11:00 1 ... 1 1
[4 rows x 10 columns]
Run Code Online (Sandbox Code Playgroud)
更新:第二个数据
所以,这一次会让事情有点不同,因为更新的目标是了解每个地方有多少车辆及其速度。
同样,为了便于适应,一般都非常小心地编写了东西。
dividing_colum)。def divide_df_by_column(df, dividing_colum='Hostname'):
df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
return df_by_groups
Run Code Online (Sandbox Code Playgroud)
dividing_colum) 组安排数据:def arrange_groups_df(lst_df_by_groups):
df_by_intervaled_group = dict()
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
df_by_intervaled_group[group_df_name] = arrange_data(group_df)
return df_by_intervaled_group
Run Code Online (Sandbox Code Playgroud)
2.1. 我们以 15 分钟为间隔进行分组,将每个主机名区域数据划分为时间间隔后,我们汇总到列的车辆数量volume并调查到列的平均速度average_speed。
def group_by_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
def arrange_data(df):
df = group_by_interval(df)
return df
Run Code Online (Sandbox Code Playgroud)
第 2 阶段的最终结果是每个主机名数据被划分为 15 分钟的时间窗口,我们知道每次通过了多少辆车以及它们的平均速度是多少。
通过这种方式,我们实现了目标:
另一个贪婪的问题是 - 我将如何将速度插入到这个测量中?即,大量的体积,但低速度,也很好地迎合。
同样,所有的都可以使用 [ TIME_INTERVAL_COLUMN_NAME, DATE_COLUMN_NAME, INTERVAL_WINDOW] 进行服装化。
整个代码:
import functools
import numpy
import pandas as pd
TIME_INTERVAL_COLUMN_NAME = 'time_interval'
DATE_COLUMN_NAME = 'DateTimeStamp'
INTERVAL_WINDOW = '15Min'
def round_time(df):
# Setting date_column_name to be of dateime
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Grouping by interval
df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
def group_by_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
def arrange_data(df):
df = group_by_interval(df)
return df
def divide_df_by_column(df, dividing_colum='Hostname'):
df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
return df_by_groups
def arrange_groups_df(lst_df_by_groups):
df_by_intervaled_group = dict()
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
df_by_intervaled_group[group_df_name] = arrange_data(group_df)
return df_by_intervaled_group
# Load data
df = pd.read_csv('data2.csv')
# Print data
print(df)
# Divide by column
df_by_groups = divide_df_by_column(df)
# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)
# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
print(f'Group {group_df_name} dataframe:')
print(intervaled_group_df)
Run Code Online (Sandbox Code Playgroud)
示例输出:
我们现在可以通过测量每个单独主机名区域的数量(车辆数量)和平均速度来获得有价值的结果。
DateTimeStamp VS_ID VS_Summary_Id Hostname Vehicle_speed Lane Length
0 11/01/2019 8:22 1 1 place_uno 65 2 71
1 11/01/2019 8:23 2 1 place_uno 59 1 375
2 11/01/2019 8:25 3 1 place_uno 59 1 389
3 11/01/2019 8:26 4 1 place_duo 59 1 832
4 11/01/2019 8:40 5 1 place_duo 52 1 409
Group Hostname_place_duo dataframe:
average_speed volume
DateTimeStamp
2019-11-01 08:15:00 59 1
2019-11-01 08:30:00 52 1
Group Hostname_place_uno dataframe:
average_speed volume
DateTimeStamp
2019-11-01 08:15:00 61 3
Run Code Online (Sandbox Code Playgroud)
附录
还创建了一个round_time函数,它允许舍入到时间间隔,无需分组:
def round_time(df):
# Setting date_column_name to be of dateime
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Grouping by interval
df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
Run Code Online (Sandbox Code Playgroud)
第三次更新
所以这次我们要减少结果中的行数。
group_by_interval函数现在更改为 group 上的简洁 inteval 因此,将被调用group_by_concised_interval。我们将 [day-in-week, hour-minute] 的组合称为“consice 间隔”,这同样可以使用CONCISE_INTERVAL_FORMAT.
def group_by_concised_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Rounding time
round_time(df)
# Adding concised interval
add_consice_interval_columns(df)
intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
Run Code Online (Sandbox Code Playgroud)
1.1. 使用该方法将group_by_concised_interval第一轮时间设置为给定的 15 分钟间隔(可通过 配置INTERVAL_WINDOW)round_time。
1.2. 在为每个日期创建时间间隔后,我们应用add_consice_interval_columns给出四舍五入到整数时间戳的函数,提取简洁的形式。
def add_consice_interval_columns(df):
# Adding columns for time interval in day-in-week and hour-minute resolution
df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))
Run Code Online (Sandbox Code Playgroud)
整个代码是:
import functools
import numpy
import pandas as pd
TIME_INTERVAL_COLUMN_NAME = 'time_interval'
TIME_INTERVAL_CONCISE_COLUMN_NAME = 'time_interval_concise'
DATE_COLUMN_NAME = 'DateTimeStamp'
INTERVAL_WINDOW = '15Min'
CONCISE_INTERVAL_FORMAT = '%A %H:%M'
def round_time(df):
# Setting date_column_name to be of dateime
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Grouping by interval
df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
def add_consice_interval_columns(df):
# Adding columns for time interval in day-in-week and hour-minute resolution
df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))
def group_by_concised_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Rounding time
round_time(df)
# Adding concised interval
add_consice_interval_columns(df)
intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
def arrange_data(df):
df = group_by_concised_interval(df)
return df
def divide_df_by_column(df, dividing_colum='Hostname'):
df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
return df_by_groups
def arrange_groups_df(lst_df_by_groups):
df_by_intervaled_group = dict()
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
df_by_intervaled_group[group_df_name] = arrange_data(group_df)
return df_by_intervaled_group
# Load data
df = pd.read_csv('data2.csv')
# Print data
print(df)
# Divide by column
df_by_groups = divide_df_by_column(df)
# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)
# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
print(f'Group {group_df_name} dataframe:')
print(intervaled_group_df)
Run Code Online (Sandbox Code Playgroud)
输出:
Group Hostname_place_duo dataframe:
average_speed volume
time_interval_concise
Friday 08:30 59 1
Friday 08:45 52 1
Group Hostname_place_uno dataframe:
average_speed volume
time_interval_concise
Friday 08:15 65 1
Friday 08:30 59 2
Run Code Online (Sandbox Code Playgroud)
所以现在我们可以很容易地找出一周中每一天在所有可用时间间隔内的流量行为。
| 归档时间: |
|
| 查看次数: |
359 次 |
| 最近记录: |