值错误 X 有 24 个特征,但 DecisionTreeClassifier 期望有 19 个特征作为输入

dat*_*den 7 python decision-tree topological-sort cross-validation

我正在尝试在我的机器上使用拓扑数据分析 (TDA)重现这个GitHub 项目。

我的步骤

  • 从交叉验证输出中获取最佳参数
  • 加载我的数据集特征选择
  • 从数据集中提取拓扑特征进行预测
  • 创建一个基于最佳参数的随机森林分类器模型
  • 计算测试数据的概率

背景

  1. 特征选择

为了决定哪些属性属于哪个组,我们创建了一个相关矩阵。由此,我们看到有两个大的群体,其中玩家属性相互之间具有很强的相关性。因此,我们决定将属性分为两组,一组概括球员的进攻特点,另一组概括防守。最后,由于守门员与其他球员的统计数据完全不同,我们决定只考虑整体评分。下面,可以看到24 个功能每个播放器使用:

进攻:“定位”、“传中”、“终结”、“heading_accuracy”、“short_passing”、“反应”、“截击”、“盘带”、“曲线”、“free_kick_accuracy”、“加速度”、“冲刺速度”、 “敏捷”、“罚球”、“视野”、“射门力量”、“远射” 防守:“拦截”、“侵略”、“盯人”、“站立铲球”、“滑动铲球”、“远距离传球” 守门员:“整体评分”

根据这组特征,我们下一步要做的是,对于每个非守门员球员,计算攻击属性和防守属性的平均值。

最后,对于给定比赛中的每支球队,我们根据球队球员的这些统计数据计算进攻和防守的平均值和标准差,以及最佳进攻和最佳防守。

以这种方式,一场比赛由 14 个特征(GK 总分、最佳进攻、标准进攻、平均进攻、最佳防守、标准防守、平均防守)来描述,这些特征将比赛映射到空间中,遵循两队的特点.


  1. 特征提取

TDA 的目的是捕捉数据底层空间的结构。在我们的项目中,我们假设数据点的邻域隐藏了与比赛结果相关的有意义的信息。因此,我们探索了寻找这种相关性的数据空间。


方法

def get_best_params():
    cv_output = read_pickle('cv_output.pickle')
    best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output

    return top_feat_params, top_model_feat_params

def load_dataset():
    x_y = get_dataset(42188).get_data(dataset_format='array')[0]
    x_train_with_topo = x_y[:, :-1]
    y_train = x_y[:, -1]

    return x_train_with_topo, y_train


def extract_x_test_features(x_train, y_train, players_df, pipeline):
    """Extract the topological features from the test set. This requires also the train set

    Parameters
    ----------
    x_train:
        The x used in the training phase
    y_train:
        The 'y' used in the training phase
    players_df: pd.DataFrame
        The DataFrame containing the matches with all the players, from which to extract the test set
    pipeline: Pipeline
        The Giotto pipeline

    Returns
    -------
    x_test:
        The x_test with the topological features
    """
    x_train_no_topo = x_train[:, :14]
    y_test = np.zeros(len(players_df))  # Artificial y_test for features computation
    print('Y_TEST',y_test.shape)

    x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)

    return x_test_topo

def extract_topological_features(diagrams):
    metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
    new_features = []
    for metric in metrics:
        amplitude = Amplitude(metric=metric)
        new_features.append(amplitude.fit_transform(diagrams))
    new_features = np.concatenate(new_features, axis=1)
    return new_features

def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
    shift = 10
    top_features = []
    all_x_train = x_train
    all_y_train = y_train
    for i in tqdm(range(0, len(x_test), shift)):
        #
        print(range(0, len(x_test), shift) )
        if i+shift > len(x_test):
            shift = len(x_test) - i
        batch = np.concatenate([all_x_train, x_test[i: i + shift]])
        batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
        diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
        new_features_batch = extract_topological_features(diagrams_batch[-shift:])
        top_features.append(new_features_batch)
        all_x_train = np.concatenate([all_x_train, batch[-shift:]])
        all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
    final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
    return final_x_test

def get_probabilities(model, x_test, team_ids):
    """Get the probabilities on the outcome of the matches contained in the test set

    Parameters
    ----------
    model:
        The model (must have the 'predict_proba' function)
    x_test:
        The test set
    team_ids: pd.DataFrame
        The DataFrame containing, for each match in the test set, the ids of the two teams
    Returns
    -------
    probabilities:
        The probabilities for each match in the test set
    """
    prob_pred = model.predict_proba(x_test)
    prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
    prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
    return prob_match_df
Run Code Online (Sandbox Code Playgroud)

工作代码

best_pipeline_params, best_model_feat_params = get_best_params()

# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}

pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
            # SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
            #('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])

x_train, y_train = load_dataset()

# x_train.shape ->  (2565, 19)
# y_train.shape -> (2565,)

x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)

# x_test.shape -> (380, 24)

rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids)  # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')
Run Code Online (Sandbox Code Playgroud)

但我收到错误:

ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.
Run Code Online (Sandbox Code Playgroud)

加载的数据集 ( X_train)

Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   home_best_attack    2565 non-null   float64
 1   home_best_defense   2565 non-null   float64
 2   home_avg_attack     2565 non-null   float64
 3   home_avg_defense    2565 non-null   float64
 4   home_std_attack     2565 non-null   float64
 5   home_std_defense    2565 non-null   float64
 6   gk_home_player_1    2565 non-null   float64
 7   away_avg_attack     2565 non-null   float64
 8   away_avg_defense    2565 non-null   float64
 9   away_std_attack     2565 non-null   float64
 10  away_std_defense    2565 non-null   float64
 11  away_best_attack    2565 non-null   float64
 12  away_best_defense   2565 non-null   float64
 13  gk_away_player_1    2565 non-null   float64
 14  bottleneck_metric   2565 non-null   float64
 15  wasserstein_metric  2565 non-null   float64
 16  landscape_metric    2565 non-null   float64
 17  betti_metric        2565 non-null   float64
 18  heat_metric         2565 non-null   float64
 19  label               2565 non-null   float64
Run Code Online (Sandbox Code Playgroud)

请注意,前 14 列是描述匹配的特征,剩下的 5 个特征(减去标签)是已经提取的拓扑特征。

问题似乎是代码何时到达 extract_x_test_features()和 时extract_features_for_prediction(),它应该获得拓扑特征并将训练数据集与它叠加。

由于 X_train 已经具有拓扑特征,它又增加了 5 个,所以我最终得到了 24 个特征。

不过,我不确定。我只是想把这个项目围绕在我的脑海里……以及这里是如何进行预测的。


如何使用上面的代码修复不匹配?


笔记

1- x_train 和 y_test 不是 dataframesnumpy.ndarray

2 - 如果从以下链接克隆或下载项目,则此问题完全可以重现:

Github 链接

dat*_*den 1

此处返回具有 19 个特征的切片:

def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
   (...)
   return final_x_test[:, :19]
Run Code Online (Sandbox Code Playgroud)

消除错误并运行测试。


但我仍然不明白它的要点。

我将向任何在项目笔记本中向我解释该项目背景下测试集背后的想法的人提供赏金,该笔记本可以在此处找到:

项目笔记本