dat*_*den 7 python decision-tree topological-sort cross-validation
我正在尝试在我的机器上使用拓扑数据分析 (TDA)重现这个GitHub 项目。
我的步骤:
背景:
为了决定哪些属性属于哪个组,我们创建了一个相关矩阵。由此,我们看到有两个大的群体,其中玩家属性相互之间具有很强的相关性。因此,我们决定将属性分为两组,一组概括球员的进攻特点,另一组概括防守。最后,由于守门员与其他球员的统计数据完全不同,我们决定只考虑整体评分。下面,可以看到24 个功能每个播放器使用:
进攻:“定位”、“传中”、“终结”、“heading_accuracy”、“short_passing”、“反应”、“截击”、“盘带”、“曲线”、“free_kick_accuracy”、“加速度”、“冲刺速度”、 “敏捷”、“罚球”、“视野”、“射门力量”、“远射” 防守:“拦截”、“侵略”、“盯人”、“站立铲球”、“滑动铲球”、“远距离传球” 守门员:“整体评分”
根据这组特征,我们下一步要做的是,对于每个非守门员球员,计算攻击属性和防守属性的平均值。
最后,对于给定比赛中的每支球队,我们根据球队球员的这些统计数据计算进攻和防守的平均值和标准差,以及最佳进攻和最佳防守。
以这种方式,一场比赛由 14 个特征(GK 总分、最佳进攻、标准进攻、平均进攻、最佳防守、标准防守、平均防守)来描述,这些特征将比赛映射到空间中,遵循两队的特点.
TDA 的目的是捕捉数据底层空间的结构。在我们的项目中,我们假设数据点的邻域隐藏了与比赛结果相关的有意义的信息。因此,我们探索了寻找这种相关性的数据空间。
方法:
def get_best_params():
cv_output = read_pickle('cv_output.pickle')
best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output
return top_feat_params, top_model_feat_params
def load_dataset():
x_y = get_dataset(42188).get_data(dataset_format='array')[0]
x_train_with_topo = x_y[:, :-1]
y_train = x_y[:, -1]
return x_train_with_topo, y_train
def extract_x_test_features(x_train, y_train, players_df, pipeline):
"""Extract the topological features from the test set. This requires also the train set
Parameters
----------
x_train:
The x used in the training phase
y_train:
The 'y' used in the training phase
players_df: pd.DataFrame
The DataFrame containing the matches with all the players, from which to extract the test set
pipeline: Pipeline
The Giotto pipeline
Returns
-------
x_test:
The x_test with the topological features
"""
x_train_no_topo = x_train[:, :14]
y_test = np.zeros(len(players_df)) # Artificial y_test for features computation
print('Y_TEST',y_test.shape)
x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)
return x_test_topo
def extract_topological_features(diagrams):
metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
new_features = []
for metric in metrics:
amplitude = Amplitude(metric=metric)
new_features.append(amplitude.fit_transform(diagrams))
new_features = np.concatenate(new_features, axis=1)
return new_features
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
shift = 10
top_features = []
all_x_train = x_train
all_y_train = y_train
for i in tqdm(range(0, len(x_test), shift)):
#
print(range(0, len(x_test), shift) )
if i+shift > len(x_test):
shift = len(x_test) - i
batch = np.concatenate([all_x_train, x_test[i: i + shift]])
batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
new_features_batch = extract_topological_features(diagrams_batch[-shift:])
top_features.append(new_features_batch)
all_x_train = np.concatenate([all_x_train, batch[-shift:]])
all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
return final_x_test
def get_probabilities(model, x_test, team_ids):
"""Get the probabilities on the outcome of the matches contained in the test set
Parameters
----------
model:
The model (must have the 'predict_proba' function)
x_test:
The test set
team_ids: pd.DataFrame
The DataFrame containing, for each match in the test set, the ids of the two teams
Returns
-------
probabilities:
The probabilities for each match in the test set
"""
prob_pred = model.predict_proba(x_test)
prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
return prob_match_df
Run Code Online (Sandbox Code Playgroud)
工作代码:
best_pipeline_params, best_model_feat_params = get_best_params()
# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}
pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
# SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
#('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])
x_train, y_train = load_dataset()
# x_train.shape -> (2565, 19)
# y_train.shape -> (2565,)
x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)
# x_test.shape -> (380, 24)
rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids) # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')
Run Code Online (Sandbox Code Playgroud)
但我收到错误:
ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.
Run Code Online (Sandbox Code Playgroud)
加载的数据集 ( X_train):
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 home_best_attack 2565 non-null float64
1 home_best_defense 2565 non-null float64
2 home_avg_attack 2565 non-null float64
3 home_avg_defense 2565 non-null float64
4 home_std_attack 2565 non-null float64
5 home_std_defense 2565 non-null float64
6 gk_home_player_1 2565 non-null float64
7 away_avg_attack 2565 non-null float64
8 away_avg_defense 2565 non-null float64
9 away_std_attack 2565 non-null float64
10 away_std_defense 2565 non-null float64
11 away_best_attack 2565 non-null float64
12 away_best_defense 2565 non-null float64
13 gk_away_player_1 2565 non-null float64
14 bottleneck_metric 2565 non-null float64
15 wasserstein_metric 2565 non-null float64
16 landscape_metric 2565 non-null float64
17 betti_metric 2565 non-null float64
18 heat_metric 2565 non-null float64
19 label 2565 non-null float64
Run Code Online (Sandbox Code Playgroud)
请注意,前 14 列是描述匹配的特征,剩下的 5 个特征(减去标签)是已经提取的拓扑特征。
问题似乎是代码何时到达 extract_x_test_features()和 时extract_features_for_prediction(),它应该获得拓扑特征并将训练数据集与它叠加。
由于 X_train 已经具有拓扑特征,它又增加了 5 个,所以我最终得到了 24 个特征。
不过,我不确定。我只是想把这个项目围绕在我的脑海里……以及这里是如何进行预测的。
如何使用上面的代码修复不匹配?
笔记:
1- x_train 和 y_test 不是 dataframes但numpy.ndarray
2 - 如果从以下链接克隆或下载项目,则此问题完全可以重现:
此处返回具有 19 个特征的切片:
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
(...)
return final_x_test[:, :19]
Run Code Online (Sandbox Code Playgroud)
消除错误并运行测试。
但我仍然不明白它的要点。
我将向任何在项目笔记本中向我解释该项目背景下测试集背后的想法的人提供赏金,该笔记本可以在此处找到: