尽管设置了种子,但运行之间的结果无法重现

Dre*_*ana 7 python random scikit-learn random-seed

使用完全相同的种子和静态数据输入运行同一个 Python 程序两次会产生不同的结果,这怎么可能呢?在 Jupyter Notebook 中调用以下函数会产生相同的结果,但是,当我重新启动内核时,结果会有所不同。当我从命令行将代码作为 Python 脚本运行时,这同样适用。人们还采取其他措施来确保他们的代码可重现吗?我找到的所有资源都谈到了播种。随机性是由 ShapRFECV 引入的。

此代码仅在 CPU 上运行。

MWE(在此代码中,我生成一个数据集并使用 ShapRFECV 消除特征,如果这很重要):

import os, random
import numpy as np
import pandas as pd
from probatus.feature_elimination import ShapRFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

global_seed = 1234
os.environ['PYTHONHASHSEED'] = str(global_seed)
np.random.seed(global_seed)
random.seed(global_seed)

feature_names = ['f1', 'f2', 'f3_static', 'f4', 'f5', 'f6', 'f7',
 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 
'f18', 'f19', 'f20']

# Code from tutorial on probatus documentation
X, y = make_classification(n_samples=100, class_sep=0.05, n_informative=6, n_features=20, 
random_state=0, n_redundant=10, n_clusters_per_class=1)
X = pd.DataFrame(X, columns=feature_names)

def shap_feature_selection(X, y, seed: int) -> list[str]:
    
    random_forest = RandomForestClassifier(random_state=seed, n_estimators=70, max_features='log2',
criterion='entropy', class_weight='balanced')
    # Set to run on one thread only
    shap_elimination = ShapRFECV(clf=random_forest, step=0.2, cv=5,
scoring='f1_macro', n_jobs=1, random_state=seed)

    report = shap_elimination.fit_compute(X, y, check_additivity=True, seed=seed)
    # Return the set of features with the best validation accuracy
    return report.iloc[[report['val_metric_mean'].idxmax() - 1]]['features_set'].to_list()[0]
Run Code Online (Sandbox Code Playgroud)

结果:

# Results from the first run
shap_feature_selection(X, y, 0)

>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']

# Running again in same session
shap_feature_selection(X, y, 0)

>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']

# Restarting the kernel and running the exact same command
shap_feature_selection(X, y, 0)
>>> ['f8', 'f1', 'f17', 'f6', 'f18', 'f20', 'f12', 'f15', 'f7', 'f13', 'f11']
Run Code Online (Sandbox Code Playgroud)

细节:

  • 乌班图22.04
  • Python 3.9.12
  • 麻木 1.22.0
  • 斯克学习1.1.1

Dre*_*ana 1

现在这个问题已经在 probatus 中得到了修复(这个问题是一个 bug,显然与他们正在使用的 pandas 实现有关,请参见此处)。对我来说,使用 probatus 的最新代码版本(而不是包)时,一切都按预期工作。