如何将线性回归应用于包含NaN的大型多维数组中的每个像素?

Rob*_*lor 9 python arrays numpy scipy python-xarray

我有一个独立变量值的一维数组(x_array),它与具有多个时间步长()的3D numpy空间数据数组中的时间步长相匹配y_array.我的实际数据要大得多:300+时间步长和高达3000*3000像素:

import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])
Run Code Online (Sandbox Code Playgroud)

我想计算每个像素的线性回归,将获得的R平方,P值,截距和斜率为每个xy像素y_array,为每个时间步长值x_array作为我的独立变量.

我可以重新整形以获取表单中的数据以将其输入到np.polyfit矢量化和快速:

# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)

# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)
Run Code Online (Sandbox Code Playgroud)

但是,这会忽略包含任何NaN值(np.polyfit不支持NaN值)的像素,并且不会计算我需要的统计数据(R平方,P值,截距和斜率).

这里答案用于scipy.stats import linregress计算我需要的统计数据,并建议NaN通过屏蔽这些NaN值来避免问题.但是,这个例子适用于两个1D数组,我无法弄清楚如何对我的情况应用类似的掩码方法,其中每个列y_array_reshaped将具有不同的NaN值集.

我的问题:如何NaN以合理快速的矢量化方式计算包含许多值的大型多维数组(300 x 3000 x 3000)中每个像素的回归统计量?

期望的结果:每个像素的3 x 3回归统计值数组(例如R平方)y_array,即使该像素包含NaN时间序列中某个点的值

Rob*_*lor 6

上面评论中提到的这篇博文包含一个非常快的矢量化函数,用于 Python 中多维数据的互相关、协方差和回归。它生成我需要的所有回归输出,并且在几毫秒内完成,因为它完全依赖于xarray.

https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html

我做了一个小改动(在 之后的第一行#3)以确保该函数正确地解释了每个像素中不同数量的 NaN 值:

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr
Run Code Online (Sandbox Code Playgroud)


Fei*_*Yao 6

这里提供的答案https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html绝对是好的,因为它主要利用了numpy矢量化和广播的强大功能,但它假设要分析数据是完整的,但在实际研究周期中通常情况并非如此。上面的一个答案旨在解决丢失数据的问题,但我个人认为需要更新更多代码,因为np.mean()如果数据中存在 nan,则会返回 nan。幸运的是,为我们numpy提供了nanmean()nanstd()等,用于通过忽略数据中的 nan 来计算平均值、标准误差等。同时,原始博客中的程序针对的是 netCDF 格式的数据。有些人可能不知道这一点,但更熟悉原始numpy.array格式。因此,我在这里提供一个代码示例,展示如何计算两个 3D 维数组(nD 维具有相同逻辑)之间的协方差、相关系数等。请注意,为了方便起见,我将x_array设为 的第一维度的索引,但在实际分析中肯定可以从外部读取。y_arrayx_array

代码

def linregress_3D(y_array):
    # y_array is a 3-D array formatted like (time,lon,lat)
    # The purpose of this function is to do linear regression using time series of data over each (lon,lat) grid box with consideration of ignoring np.nan
    # Construct x_array indicating time indexes of y_array, namely the independent variable.
    x_array=np.empty(y_array.shape)
    for i in range(y_array.shape[0]): x_array[i,:,:]=i+1 # This would be fine if time series is not too long. Or we can use i+yr (e.g. 2019).
    x_array[np.isnan(y_array)]=np.nan
    # Compute the number of non-nan over each (lon,lat) grid box.
    n=np.sum(~np.isnan(x_array),axis=0)
    # Compute mean and standard deviation of time series of x_array and y_array over each (lon,lat) grid box.
    x_mean=np.nanmean(x_array,axis=0)
    y_mean=np.nanmean(y_array,axis=0)
    x_std=np.nanstd(x_array,axis=0)
    y_std=np.nanstd(y_array,axis=0)
    # Compute co-variance between time series of x_array and y_array over each (lon,lat) grid box.
    cov=np.nansum((x_array-x_mean)*(y_array-y_mean),axis=0)/n
    # Compute correlation coefficients between time series of x_array and y_array over each (lon,lat) grid box.
    cor=cov/(x_std*y_std)
    # Compute slope between time series of x_array and y_array over each (lon,lat) grid box.
    slope=cov/(x_std**2)
    # Compute intercept between time series of x_array and y_array over each (lon,lat) grid box.
    intercept=y_mean-x_mean*slope
    # Compute tstats, stderr, and p_val between time series of x_array and y_array over each (lon,lat) grid box.
    tstats=cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
    stderr=slope/tstats
    from scipy.stats import t
    p_val=t.sf(tstats,n-2)*2
    # Compute r_square and rmse between time series of x_array and y_array over each (lon,lat) grid box.
    # r_square also equals to cor**2 in 1-variable lineare regression analysis, which can be used for checking.
    r_square=np.nansum((slope*x_array+intercept-y_mean)**2,axis=0)/np.nansum((y_array-y_mean)**2,axis=0)
    rmse=np.sqrt(np.nansum((y_array-slope*x_array-intercept)**2,axis=0)/n)
    # Do further filteration if needed (e.g. We stipulate at least 3 data records are needed to do regression analysis) and return values
    n=n*1.0 # convert n from integer to float to enable later use of np.nan
    n[n<3]=np.nan
    slope[np.isnan(n)]=np.nan
    intercept[np.isnan(n)]=np.nan
    p_val[np.isnan(n)]=np.nan
    r_square[np.isnan(n)]=np.nan
    rmse[np.isnan(n)]=np.nan
    return n,slope,intercept,p_val,r_square,rmse
Run Code Online (Sandbox Code Playgroud)

样本输出

我使用这个程序测试了两个 227x3601x6301 像素的 3D 阵列,它在 20 分钟内完成了工作,每个阵列不到 10 分钟。


cos*_*iry 5

我不确定这将如何扩展(也许您可以使用daskDataFrame ),但这里有一个非常简单的方法,可以使用apply方法使用pandas 来实现此目的:

import pandas as pd
import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

def lin_regress(col):
    "Mask nulls and apply stats.linregress"
    col = col.loc[~pd.isnull(col)]
    return linregress(col.index.tolist(), col)

# Build the DataFrame (each index represents a pixel)
df = pd.DataFrame(y_array.reshape(len(y_array), -1), index=x_array.tolist())

# Apply a our custom linregress wrapper to each function, split the tuple into separate columns
final_df = df.apply(lin_regress).apply(pd.Series)

# Name the index and columns to make this easier to read
final_df.columns, final_df.index.name = 'slope, intercept, r_value, p_value, std_err'.split(', '), 'pixel_number'

print(final_df)
Run Code Online (Sandbox Code Playgroud)

输出:

                 slope  intercept   r_value       p_value   std_err
pixel_number                                                       
0             0.071429  -0.192857  0.188982  8.789623e-01  0.371154
1            -0.071429  -0.207143 -0.188982  8.789623e-01  0.371154
2             0.357143  -0.464286  0.944911  2.122956e-01  0.123718
3             0.105263  -0.289474  0.229416  7.705843e-01  0.315789
4             1.000000  -0.700000  1.000000  9.003163e-11  0.000000
5            -0.285714  -0.328571 -0.188982  8.789623e-01  1.484615
6             0.105263  -0.289474  0.132453  8.675468e-01  0.557000
7            -0.285714  -0.228571 -0.755929  4.543711e-01  0.247436
8             0.071429  -0.392857  0.188982  8.789623e-01  0.371154
Run Code Online (Sandbox Code Playgroud)