尽管设置了种子，但运行之间的结果无法重现

Question

尽管设置了种子，但运行之间的结果无法重现

Dre*_*ana 7 python random scikit-learn random-seed

使用完全相同的种子和静态数据输入运行同一个 Python 程序两次会产生不同的结果，这怎么可能呢？在 Jupyter Notebook 中调用以下函数会产生相同的结果，但是，当我重新启动内核时，结果会有所不同。当我从命令行将代码作为 Python 脚本运行时，这同样适用。人们还采取其他措施来确保他们的代码可重现吗？我找到的所有资源都谈到了播种。随机性是由 ShapRFECV 引入的。

此代码仅在 CPU 上运行。

MWE（在此代码中，我生成一个数据集并使用 ShapRFECV 消除特征，如果这很重要）：

import os, random
import numpy as np
import pandas as pd
from probatus.feature_elimination import ShapRFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

global_seed = 1234
os.environ['PYTHONHASHSEED'] = str(global_seed)
np.random.seed(global_seed)
random.seed(global_seed)

feature_names = ['f1', 'f2', 'f3_static', 'f4', 'f5', 'f6', 'f7',
 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 
'f18', 'f19', 'f20']

# Code from tutorial on probatus documentation
X, y = make_classification(n_samples=100, class_sep=0.05, n_informative=6, n_features=20, 
random_state=0, n_redundant=10, n_clusters_per_class=1)
X = pd.DataFrame(X, columns=feature_names)

def shap_feature_selection(X, y, seed: int) -> list[str]:
    
    random_forest = RandomForestClassifier(random_state=seed, n_estimators=70, max_features='log2',
criterion='entropy', class_weight='balanced')
    # Set to run on one thread only
    shap_elimination = ShapRFECV(clf=random_forest, step=0.2, cv=5,
scoring='f1_macro', n_jobs=1, random_state=seed)

    report = shap_elimination.fit_compute(X, y, check_additivity=True, seed=seed)
    # Return the set of features with the best validation accuracy
    return report.iloc[[report['val_metric_mean'].idxmax() - 1]]['features_set'].to_list()[0]

Run Code Online (Sandbox Code Playgroud)

结果：

# Results from the first run
shap_feature_selection(X, y, 0)

>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']

# Running again in same session
shap_feature_selection(X, y, 0)

>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']

# Restarting the kernel and running the exact same command
shap_feature_selection(X, y, 0)
>>> ['f8', 'f1', 'f17', 'f6', 'f18', 'f20', 'f12', 'f15', 'f7', 'f13', 'f11']

Run Code Online (Sandbox Code Playgroud)

细节：

乌班图22.04
Python 3.9.12
麻木 1.22.0
斯克学习1.1.1

Answer 1

Dre*_*ana 1

现在这个问题已经在 probatus 中得到了修复（这个问题是一个 bug，显然与他们正在使用的 pandas 实现有关，请参见此处）。对我来说，使用 probatus 的最新代码版本（而不是包）时，一切都按预期工作。

归档时间：	2 年，7 月前
查看次数：	465 次
最近记录：	2 年，6 月前