如何对 pandas 数据框运行多重共线性测试?

Aak*_*asu 6 pandas statsmodels python-3.6

我对 Python、统计和使用 DS 库比较陌生,我的要求是对具有 n 列的数据集运行多重共线性测试,并确保 VIF > 5 的列/变量被完全删除。

我找到了一个代码,

 from statsmodels.stats.outliers_influence import variance_inflation_factor

    def calculate_vif_(X, thresh=5.0):

        variables = range(X.shape[1])
        tmp = range(X[variables].shape[1])
        print(tmp)
        dropped=True
        while dropped:
            dropped=False
            vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]

            maxloc = vif.index(max(vif))
            if max(vif) > thresh:
                print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                del variables[maxloc]
                dropped=True

        print('Remaining variables:')
        print(X.columns[variables])
        return X[variables]
Run Code Online (Sandbox Code Playgroud)

但是,我不清楚,我应该将数据集完全传递到 X 参数的位置吗?如果是,则不起作用。

请帮忙!

小智 6

我也遇到了运行类似问题的问题。我通过更改variables定义方式并找到另一种删除其元素的方法来修复它。

以下脚本应适用于 Anaconda 5.0.1 和 Python 3.6(撰写本文时的最新版本)。

import numpy as np
import pandas as pd
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor    
from joblib import Parallel, delayed

# Defining the function that you will run later
def calculate_vif_(X, thresh=5.0):
    variables = [X.columns[i] for i in range(X.shape[1])]
    dropped=True
    while dropped:
        dropped=False
        print(len(variables))
        vif = Parallel(n_jobs=-1,verbose=5)(delayed(variance_inflation_factor)(X[variables].values, ix) for ix in range(len(variables)))

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print(time.ctime() + ' dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
            variables.pop(maxloc)
            dropped=True

    print('Remaining variables:')
    print([variables])
    return X[[i for i in variables]]

X = df[feature_list] # Selecting your data

X2 = calculate_vif_(X,5) # Actually running the function
Run Code Online (Sandbox Code Playgroud)

如果你有很多功能,那么运行起来会花费很长时间。因此,我做了另一项更改,让它并行工作,以防您有多个可用的 CPU。

享受!


Aak*_*asu 3

我对代码进行了调整,并通过以下代码设法实现了所需的结果,并进行了一些异常处理 -

def multicollinearity_check(X, thresh=5.0):
    data_type = X.dtypes
    # print(type(data_type))
    int_cols = \
    X.select_dtypes(include=['int', 'int16', 'int32', 'int64', 'float', 'float16', 'float32', 'float64']).shape[1]
    total_cols = X.shape[1]
    try:
        if int_cols != total_cols:
            raise Exception('All the columns should be integer or float, for multicollinearity test.')
        else:
            variables = list(range(X.shape[1]))
            dropped = True
            print('''\n\nThe VIF calculator will now iterate through the features and calculate their respective values.
            It shall continue dropping the highest VIF features until all the features have VIF less than the threshold of 5.\n\n''')
            while dropped:
                dropped = False
                vif = [variance_inflation_factor(X.iloc[:, variables].values, ix) for ix in variables]
                print('\n\nvif is: ', vif)
                maxloc = vif.index(max(vif))
                if max(vif) > thresh:
                    print('dropping \'' + X.iloc[:, variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                    # del variables[maxloc]
                    X.drop(X.columns[variables[maxloc]], 1, inplace=True)
                    variables = list(range(X.shape[1]))
                    dropped = True

            print('\n\nRemaining variables:\n')
            print(X.columns[variables])
            # return X.iloc[:,variables]
            return X
    except Exception as e:
        print('Error caught: ', e)
Run Code Online (Sandbox Code Playgroud)