在Python中计算Pearson相关性和显着性

ari*_*iel 185 python statistics numpy scipy

我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性相关性的重要性.

Sac*_*cha 196

你可以看看scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
Run Code Online (Sandbox Code Playgroud)

  • 两本词典的相关系数怎么样?! (2认同)
  • @ user702846皮尔逊相关性在2xN矩阵上定义。没有普遍适用的方法将两个字典转换为2xN矩阵,但是您可以使用与字典键的交点对应的字典值对数组。 (2认同)

win*_*erd 106

Pearson相关性可以用numpy来计算corrcoef.

import numpy
numpy.corrcoef(list1, list2)[0, 1]
Run Code Online (Sandbox Code Playgroud)

  • 这不会产生所要求的相关性意义,对吧? (3认同)

Sal*_*ali 51

另一种选择可以是来自linregress的本地scipy函数,它计算:

斜率:回归线的斜率

截距:回归线的截距

r值:相关系数

p值:假设检验的双侧p值,其零假设是斜率为零

stderr:估计的标准误差

这是一个例子:

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)
Run Code Online (Sandbox Code Playgroud)

会回报你:

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)
Run Code Online (Sandbox Code Playgroud)

  • 很好的答案 - 迄今为止信息量最大的.也适用于两行pandas.DataFrame:`lineregress(two_row_df)` (2认同)

Jef*_*her 35

如果您不想安装scipy,我已经使用了这个快速入侵,稍微修改了编程集体智能:

(编辑正确.)

from itertools import imap

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(map(lambda x: pow(x, 2), x))
  sum_y_sq = sum(map(lambda x: pow(x, 2), y))
  psum = sum(imap(lambda x, y: x * y, x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den
Run Code Online (Sandbox Code Playgroud)

  • 正如评论一样,考虑像scipy等人的库是由了解大量数值分析的人开发的.这可以避免很多常见的陷阱(例如,X或Y中的数字非常大且非常少可能会导致灾难性的取消) (10认同)
  • 作为一种风格,Python对这种不必要的地图使用不满意(有利于列表推导) (3认同)
  • 我很惊讶地发现这与Excel,NumPy和R不同意.请参阅http://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python/7939259#7939259. (2认同)
  • 正如另一位评论者所指出的,这有一个浮点数/ int错误.我认为sum_y/n是整数的整数除法.如果使用sum_x = float(sum(x))和sum_y = float(sum(y)),则可以使用. (2认同)

dfr*_*kow 31

以下代码是定义的直接解释:

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)
Run Code Online (Sandbox Code Playgroud)

测试:

print pearson_def([1,2,3], [1,5,7])
Run Code Online (Sandbox Code Playgroud)

回报

0.981980506062
Run Code Online (Sandbox Code Playgroud)

这符合Excel中,这个计算器,SciPy的(也NumPy的),它分别返回0.981980506和0.9819805060619657和0.98198050606196574.

R:

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805
Run Code Online (Sandbox Code Playgroud)

编辑:修正了评论者指出的错误.

  • 小心变量的类型!您遇到了int/float问题.在`sum(x)/ len(x)`中,你划分整数,而不是浮点数.所以`sum([1,5,7])/ len([1,5,7])= 13/3 = 4`,根据整数除法(而你想要`13./3. = 4.33 ... `).要修复它,将此行重写为`float(sum(x))/ float(len(x))`(一个float就足够了,因为Python会自动转换它). (4认同)
  • 没有为任何这些情况定义相关系数.将它们放入R中会为所有三个返回"NA". (4认同)

Mar*_*oma 25

你也可以这样做pandas.DataFrame.corr:

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()
Run Code Online (Sandbox Code Playgroud)

这给了

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000
Run Code Online (Sandbox Code Playgroud)

  • 这只是没有意义的相关性 (2认同)

com*_*ski 12

我认为我的答案应该是最容易编码和理解计算Pearson相关系数(PCC)的步骤,而不是依赖于numpy/scipy .

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x) 

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x: 
        scorex.append((i - mean(x))/sampleStandardDeviation(x)) 

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
Run Code Online (Sandbox Code Playgroud)

PCC 的重要性基本上是向您展示两个变量/列表的相关性.值得注意的是,PCC值的范围是-1到1.0到1之间的值表示正相关.值0 =最高变化(无任何相关性).-1到0之间的值表示负相关.

  • 它在具有500多个值的2个列表上具有惊人的复杂性和缓慢的性能. (4认同)
  • 请注意,Python有一个内置的"sum"函数. (2认同)

Mar*_*sen 7

嗯,很多这些回复都有很长的难以阅读的代码......

在使用数组时,我建议使用numpy及其漂亮的功能:

import numpy as np
def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)
Run Code Online (Sandbox Code Playgroud)


Moj*_*adi 6

这是使用numpy的Pearson Correlation函数的实现:


def corr(data1, data2):
    "data1 & data2 should be numpy arrays."
    mean1 = data1.mean() 
    mean2 = data2.mean()
    std1 = data1.std()
    std2 = data2.std()

#     corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
    corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
    return corr


Web*_*ter 6

使用python中的pandas进行Pearson系数计算:由于您的数据包含列表,建议您尝试这种方法。与数据进行交互并从控制台进行操作很容易,因为您可以可视化数据结构并根据需要进行更新。您还可以导出数据集并保存它,并从python控制台中添加新数据以供以后分析。此代码更简单,并且包含更少的代码行。我假设您需要一些快速的代码行来筛选数据以进行进一步分析

例:

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results 
Run Code Online (Sandbox Code Playgroud)

但是,您没有为我发布数据以查看数据集的大小或分析之前可能需要进行的转换。


Tur*_*ute 5

这是mkh答案的变体,运行速度比它快得多,scipy.stats.pearsonr使用numba.

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)
Run Code Online (Sandbox Code Playgroud)