在 cross_val_score 中使用 TimeSeriesSplit

use*_*548 5 python time-series scikit-learn

我正在拟合一个时间序列。从这个意义上说,我正在尝试使用该函数进行交叉验证TimeSeriesSplit。我相信应用此函数的最简单方法是通过该cross_val_score函数,通过 cv 参数。

问题很简单,我传递简历参数的方式正确吗?我应该做split(scaled_train)还是应该使用split(X_train)split(input_data)?或者,我应该以另一种方式交叉验证?

这是我正在编写的代码:

  def fit_model1(data: pd.DataFrame):
      df = data
      scores_fit_model1 = []
      for sizes in test_sizes:
        # Generate Test Design
        input_data = df.drop('next_count',axis=1)
        output_data = df[['next_count']]
        X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=sizes, random_state=0, shuffle=False)
    
        #scaling
        scaler = MinMaxScaler()
        scaled_train = scaler.fit_transform(X_train)
        scaled_test = scaler.transform(X_test)
    
        #Build Model
        lr = LinearRegression()
        lr.fit(scaled_train, y_train.values.ravel())
        predictions  = lr.predict(scaled_test)
    
        #Cross Validation Definition
        time_split = TimeSeriesSplit(n_splits=10)
    
        #performance metrics
        r2 = cross_val_score(lr, scaled_train, y_train.values.ravel(), cv=time_split.split(scaled_train), scoring = 'r2', n_jobs =1).mean() 
        scores_fit_model1.append(r2)
        
      return scores_fit_model1
Run Code Online (Sandbox Code Playgroud)

Jos*_*der 0

TimeSeriesSplit只是一个迭代器,它产生一个不断增长的连续折叠窗口。因此,您可以将其按原样传递给cv,也可以传递time_series_split(scaled_train),这相当于相同的事情:在与训练数据大小相同的数组中进行分割(作为cross_val_score第二个位置参数)。获取缩放后的数据还是原始数据并不重要TimeSeriesSplit,只要cross_val_score有缩放后的数据即可。

我还在您的代码中做了一些小的简化 - 在 之前进行缩放train_test_split,并使输出数据成为 Series (所以您不需要values.ravel):

def fit_model1(data: pd.DataFrame):
    df = data
    scores_fit_model1 = []
    for sizes in test_sizes:
        # Generate Test Design
        input_data = df.drop('next_count',axis=1)
        output_data = df['next_count']
        scaler = MinMaxScaler()
        scaled_input = scaler.fit_transform(input_data)
        X_train, X_test, y_train, y_test = train_test_split(scaled_input, output_data, test_size=sizes, random_state=0, shuffle=False)

        #Build Model
        lr = LinearRegression()
        lr.fit(X_train, y_train)
        predictions = lr.predict(X_test)

        #Cross Validation Definition
        time_split = TimeSeriesSplit(n_splits=10)

        #performance metrics
        r2 = cross_val_score(lr, X_train, y_train, cv=time_split, scoring = 'r2', n_jobs =1).mean() 
        scores_fit_model1.append(r2)

    return scores_fit_model1
Run Code Online (Sandbox Code Playgroud)