如何从python中的拟合scikit-survival模型解释.predict()的输出?

Max*_*wer 9 python machine-learning survival-analysis scikit-survival

我很困惑如何解释scikit-survival中.predict拟合CoxnetSurvivalAnalysis模型的输出.我已经阅读了scikit-survival中的笔记本生存分析介绍和API参考,但无法找到解释.以下是导致我混淆的最小例子:

import pandas as pd
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.linear_model import CoxnetSurvivalAnalysis

# load data
data_X, data_y = load_veterans_lung_cancer()

# one-hot-encode categorical columns in X
categorical_cols = ['Celltype', 'Prior_therapy', 'Treatment']

X = data_X.copy()
for c in categorical_cols:
    dummy_matrix = pd.get_dummies(X[c], prefix=c, drop_first=False)
    X = pd.concat([X, dummy_matrix], axis=1).drop(c, axis=1)

# display final X to fit Cox Elastic Net model on
del data_X
print(X.head(3))
Run Code Online (Sandbox Code Playgroud)

所以这是进入模型的X:

   Age_in_years  Celltype  Karnofsky_score  Months_from_Diagnosis  \
0          69.0  squamous             60.0                    7.0   
1          64.0  squamous             70.0                    5.0   
2          38.0  squamous             60.0                    3.0   

  Prior_therapy Treatment  
0            no  standard  
1           yes  standard  
2            no  standard  
Run Code Online (Sandbox Code Playgroud)

......继续拟合模型并生成预测:

# Fit Model
coxnet = CoxnetSurvivalAnalysis()
coxnet.fit(X, data_y)    

# What are these predictions?    
preds = coxnet.predict(X)
Run Code Online (Sandbox Code Playgroud)

preds具有相同数量的记录X,但它们的值与其中的值不同data_y,即使在它们适合的相同数据上进行预测时也是如此.

print(preds.mean()) 
print(data_y['Survival_in_days'].mean())
Run Code Online (Sandbox Code Playgroud)

输出:

-0.044114643249153422
121.62773722627738
Run Code Online (Sandbox Code Playgroud)

到底究竟是preds什么?显然,这.predict意味着与scikit-learn有很大不同,但我无法弄清楚是什么.该API参考说,返回"预测决策功能,"但到底是什么意思呢?如何获得给定的预测月yhatX?我是生存分析的新手,所以我显然错过了一些东西.

小智 0

通过 X 输入,您可以获得输入数组的评估:

def predict(self, X, alpha=None):
    """The linear predictor of the model.
    Parameters
    ----------
    X : array-like, shape = (n_samples, n_features)
        Test data of which to calculate log-likelihood from
    alpha : float, optional
        Constant that multiplies the penalty terms. If the same alpha was used during training, exact
        coefficients are used, otherwise coefficients are interpolated from the closest alpha values that
        were used during training. If set to ``None``, the last alpha in the solution path is used.
    Returns
    -------
    T : array, shape = (n_samples,)
        The predicted decision function
    """
    X = check_array(X)
    coef = self._get_coef(alpha)
    return numpy.dot(X, coef)
Run Code Online (Sandbox Code Playgroud)

定义 check_array 来自另一个您可以查看coxnet的代码。