大熊猫栏目相关性具有统计学意义

Question

大熊猫栏目相关性具有统计学意义

wol*_*oor 17 python scipy correlation pandas

给出一个pandas数据帧df,以获得其列df.1与之间的相关性,最好的方法是什么df.2？

我不希望输出计数行NaN,pandas内置的相关性.但我也希望它输出一个pvalue或一个标准错误,内置的错误.

SciPy 似乎被NaN追上了,尽管我相信它确实具有重要意义.

数据示例:

     1           2
0    2          NaN
1    NaN         1
2    1           2
3    -4          3
4    1.3         1
5    NaN         NaN

Run Code Online (Sandbox Code Playgroud)

Answer 1

Sha*_*wal 23

您可以使用scipy.stats关联函数来获取p值.

例如,如果要查找pearson相关性等相关性,可以使用pearsonr函数.

from scipy.stats import pearsonr
pearsonr([1, 2, 3], [4, 3, 7])

Run Code Online (Sandbox Code Playgroud)

给出输出

(0.7205766921228921, 0.48775429164459994)

Run Code Online (Sandbox Code Playgroud)

其中元组中的第一个值是相关值,第二个值是p值.

在您的情况下,您可以使用pandas的dropna函数首先删除NaN值.

df_clean = df[['column1', 'column2']].dropna()
pearsonr(df_clean['column1'], df_clean['column2'])

Run Code Online (Sandbox Code Playgroud)

Answer 2

tot*_*ico 15

要一次计算所有p值,您可以使用以下calculate_pvalues函数:

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1], 'D':['text',2,3] })
calculate_pvalues(df)

Run Code Online (Sandbox Code Playgroud)

输出类似于corr()(但有p值):

        A       B       C
A       0  0.7877  0.1789
B  0.7877       0  0.6088
C  0.1789  0.6088       0

Run Code Online (Sandbox Code Playgroud)

p值四舍五入为4位小数
列D被忽略,因为它包含文本.

以下是该函数的代码:

from scipy.stats import pearsonr
import pandas as pd

def calculate_pvalues(df):
    df = df.dropna()._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
    return pvalues

Run Code Online (Sandbox Code Playgroud)

Answer 3

BKa*_*Kay 11

@Shashank提供的答案很好.但是,如果您想要纯粹的解决方案pandas,您可能会喜欢这样:

import pandas as pd
from pandas.io.data import DataReader
from datetime import datetime
import scipy.stats  as stats


gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1)))
vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1)))

#Do it with a pandas regression to get the p value from the F-test
df = gdp.merge(vix,left_index=True, right_index=True, how='left')
vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True)
print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value'])

Run Code Online (Sandbox Code Playgroud)

结果:

-0.0422917932738 0.851762475093

Run Code Online (Sandbox Code Playgroud)

与stats功能相同的结果:

#Do it with stats functions. 
df_clean = df.dropna()
stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP'])

Run Code Online (Sandbox Code Playgroud)

结果:

  (-0.042291793273791969, 0.85176247509284908)

Run Code Online (Sandbox Code Playgroud)

为了扩展到更多的可用性,我给你一个丑陋的循环方法:

#Add a third field
oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1))) 
df = df.merge(oil,left_index=True, right_index=True, how='left')

#construct two arrays, one of the correlation and the other of the p-vals
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        JonI        = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
        pval[i,j]  = JonI.f_stat['p-value']

Run Code Online (Sandbox Code Playgroud)

rho的结果:

             GDP    VIXCLS  DCOILWTICO
 GDP         1.000000 -0.042292    0.870251
 VIXCLS     -0.042292  1.000000   -0.004612
 DCOILWTICO  0.870251 -0.004612    1.000000

Run Code Online (Sandbox Code Playgroud)

pval的结果:

 [[  0.00000000e+00   8.51762475e-01   1.11022302e-16]
  [  8.51762475e-01   0.00000000e+00   9.83747425e-01]
  [  1.11022302e-16   9.83747425e-01   0.00000000e+00]]

Run Code Online (Sandbox Code Playgroud)

Answer 4

Fab*_*ost 10

在 pandas v0.24.0 中，一个method参数被添加到corr. 现在，你可以这样做：

import pandas as pd
import numpy as np
from scipy.stats import pearsonr

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1]})

df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns))

Run Code Online (Sandbox Code Playgroud)

          A         B         C
A  0.000000  0.787704  0.178912
B  0.787704  0.000000  0.608792
C  0.178912  0.608792  0.000000

Run Code Online (Sandbox Code Playgroud)

请注意所需的解决方法np.eye(len(df.columns))，因为自相关始终设置为1.0（请参阅https://github.com/pandas-dev/pandas/issues/25726）。

我刚刚更新了答案。现在清楚了吗？ (2认同)

Answer 5

toz*_*CSS 8

rho = df.corr()
rho = rho.round(2)
pval = calculate_pvalues(df) # toto_tico's answer
# create three masks
r1 = rho.applymap(lambda x: '{}*'.format(x))
r2 = rho.applymap(lambda x: '{}**'.format(x))
r3 = rho.applymap(lambda x: '{}***'.format(x))
# apply them where appropriate
rho = rho.mask(pval<=0.1,r1)
rho = rho.mask(pval<=0.05,r2)
rho = rho.mask(pval<=0.01,r3)
rho
# note I prefer readability over the conciseness of code, 
# instead of six lines it could have been a single liner like this:
# [rho.mask(pval<=p,rho.applymap(lambda x: '{}*'.format(x)),inplace=True) for p in [.1,.05,.01]]

Run Code Online (Sandbox Code Playgroud)

这非常适合实施。它是否有可能以 np.triu 作为掩码进入 sns.heatmap 中？ (5认同)

归档时间：	11 年，3 月前
查看次数：	37382 次
最近记录：	6 年，6 月前