Tha*_*ude 2 python correlation dataframe pearson-correlation
我有一个这样的DataFrame
dict_ = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99]}
df = pd.DataFrame(dict_, columns=dict_.keys())
Run Code Online (Sandbox Code Playgroud)
然后,我计算列之间的皮尔逊相关性,并过滤掉相关于我的阈值0.95以上的列
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.eye(len(df_corr), dtype=bool)).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
Run Code Online (Sandbox Code Playgroud)
产生
uncorrelated_factors = trimm_correlated(df, 0.95)
print uncorrelated_factors
Col3
0 0.33
1 0.98
2 1.54
3 0.01
4 0.99
Run Code Online (Sandbox Code Playgroud)
到目前为止,我对结果感到满意,但我想保留每个相关对中的一列,因此在上面的示例中,我想包含Col1或Col2。得到某物 像这样
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99
Run Code Online (Sandbox Code Playgroud)
另外,我还可以做进一步的评估来确定保留哪些相关列?
谢谢
您可以使用np.tril()代替np.eye()遮罩:
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.tril(np.ones([len(df_corr)]*2, dtype=bool))).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
Run Code Online (Sandbox Code Playgroud)
输出:
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99
Run Code Online (Sandbox Code Playgroud)