计算 xarray 中缺失数据的相关性

cli*_*ray 4 python numpy correlation python-xarray

我正在尝试计算 xarray 中两个数据集沿时间维度的相关性。我的数据集都是纬度 x 经度 x 时间。我的一个数据集有足够的数据丢失,这对于插值和消除间隙是不合理的,相反,我想忽略丢失的值。我有一些简单的代码可以在一定程度上发挥作用,但没有一个适合我的具体用例。例如:

def covariance(x,y,dims=None):
    return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims)

def correlation(x,y,dims=None):
    return covariance(x,y,dims) / (x.std(dims) * y.std(dims))
Run Code Online (Sandbox Code Playgroud)

如果没有丢失数据,则效果很好,但当然不能与 nan 一起使用。虽然这里有一个为 xarray 编写的很好的例子,但即使使用这段代码,我仍在努力计算皮尔逊的相关性而不是斯皮尔曼的相关性。

import numpy as np
import xarray as xr
import bottleneck

def covariance_gufunc(x, y):
    return ((x - x.mean(axis=-1, keepdims=True))
            * (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)

def pearson_correlation_gufunc(x, y):
    return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))

def spearman_correlation_gufunc(x, y):
    x_ranks = bottleneck.rankdata(x, axis=-1)
    y_ranks = bottleneck.rankdata(y, axis=-1)
    return pearson_correlation_gufunc(x_ranks, y_ranks)

def spearman_correlation(x, y, dim):
    return xr.apply_ufunc(
        spearman_correlation_gufunc, x, y,
        input_core_dims=[[dim], [dim]],
        dask='parallelized',
        output_dtypes=[float])
Run Code Online (Sandbox Code Playgroud)

最后, github 上有一个有用的讨论,将其作为一个功能添加到 xarray 中,但尚未实现。有没有一种有效的方法可以在存在数据缺口的数据集上做到这一点?

And*_*ams 5

我一直在关注这个 Github 讨论以及随后实现方法的尝试.corr(),看起来我们已经非常接近了,但还没有实现。

与此同时,大多数人试图合并的基本代码在另一个答案中得到了很好的概述(How to apply Linear regression to everypixel in a large multi-Dimensional array contains NaNs?)。这是一个很好的解决方案,它利用 NumPy 中的矢量化运算,并进行一些小的调整(请参阅链接中接受的答案)来解释沿时间轴的 NaN。

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr
Run Code Online (Sandbox Code Playgroud)

希望这可以帮助!祈祷合并很快就会到来。

  • `xr.cov` 和 `xr.corr` 的 PR 已在 https://github.com/pydata/xarray/pull/4089 合并,应该会在即将发布的 xarray v0.16.0 中发布。 (2认同)
  • `xr.corr` 和 `xr.cov` 现已发布!:) (2认同)