Scipy Gaussian KDE：矩阵不是正定的

Question

Scipy Gaussian KDE：矩阵不是正定的

我试图使用 scipy 来估计数据集在某些点的密度。

from scipy.stats import gaussian_kde
import numpy as np

Run Code Online (Sandbox Code Playgroud)

我有一个A3D 点数据集（这只是一个最小的例子。我的实际数据有更多维度和更多样本）

A = np.array([[0.078377  , 0.76737392, 0.45038174],
       [0.65990129, 0.13154658, 0.30770917],
       [0.46068406, 0.22751313, 0.28122463]])

Run Code Online (Sandbox Code Playgroud)

以及我想要估计密度的点

B = np.array([[0.40209377, 0.21063273, 0.75885516],
       [0.91709997, 0.79303252, 0.65156937]])

Run Code Online (Sandbox Code Playgroud)

但我似乎无法使用该gaussian_kde功能，因为

result = gaussian_kde(A.T)(B.T)

Run Code Online (Sandbox Code Playgroud)

回报

LinAlgError: Matrix is not positive definite

Run Code Online (Sandbox Code Playgroud)

我该如何修复这个错误？如何获得样品的密度？

Answer 1

fr_*_*lio 5

TL; 博士：

您的数据中具有高度相关的特征，这会导致数值错误。有几种可能的方法可以解决这个问题，每种方法都有优点和缺点。下面提出了一个直接替换类gaussian_kde。

诊断

您的（即您在创建对象时而不是dataset使用对象时提供的矩阵）可能包含高度依赖的特征。这一事实（可能与低数值分辨率和“太多”数据点相结合）导致协方差矩阵具有接近零的特征值，由于数值误差可能会低于零。这反过来会破坏在数据集协方差矩阵上使用 Cholesky 分解的代码（具体细节请参阅解释）。gaussian_kdefloat32

假设您的数据集具有形状，(dims, N)您可以通过测试这是否是您的问题np.linalg.eigh(np.cov(dataset))[0] <= 0。如果有任何结果出来True，请让我第一个欢迎您加入俱乐部。

治疗方法

这个想法是让所有特征值都大于零。

将数值分辨率提高到实用的最高浮点数可能是一个简单的解决方案，值得尝试，但可能还不够。

考虑到这是由相关特征引起的，删除数据点并没有多大帮助。通过减少需要粉碎的数字来传播更少的误差并保持特征值大于零的希望渺茫。它很容易实现，但它会丢弃数据点。

更复杂的修复方法是识别高度相关的特征并将它们合并或忽略“多余”的特征。这是很棘手的，尤其是当维度之间的相关性因实例而异时。

也许最干净的方法是保持数据不变，并将有问题的特征值提升为正值。这通常通过两种方式完成：

SVD直接解决了问题的核心：分解协方差矩阵并用小的正特征值替换负特征值epsilon。这将使您的矩阵回到 PD 域，从而引入最小的误差。

如果 SVD 的计算成本太高，另一种数值方法就是添加epsilon * np.eye(D)到协方差矩阵中。epsilon这具有添加到每个特征值的效果。同样，如果epsilon足够小，就不会引入太大的错误。

如果您采用最后一种方法，您将需要告诉gaussian_kde修改其协方差矩阵。我发现这是一种相对干净的方法：只需将此类添加到代码库中并替换gaussian_kde为GaussianKde（在我这边测试过，似乎工作正常）。

class GaussianKde(gaussian_kde): """ Drop-in replacement for gaussian_kde that adds the class attribute EPSILON to the covmat eigenvalues, to prevent exceptions due to numerical error. """ EPSILON = 1e-10 # adjust this at will def _compute_covariance(self): """Computes the covariance matrix for each Gaussian kernel using covariance_factor(). """ self.factor = self.covariance_factor() # Cache covariance and inverse covariance of the data if not hasattr(self, '_data_inv_cov'): self._data_covariance = np.atleast_2d(np.cov(self.dataset, rowvar=1, bias=False, aweights=self.weights)) # we're going the easy way here self._data_covariance += self.EPSILON * np.eye( len(self._data_covariance)) self._data_inv_cov = np.linalg.inv(self._data_covariance) self.covariance = self._data_covariance * self.factor**2 self.inv_cov = self._data_inv_cov / self.factor**2 L = np.linalg.cholesky(self.covariance * 2 * np.pi) self._norm_factor = 2*np.log(np.diag(L)).sum() # needed for scipy 1.5.2 self.log_det = 2*np.log(np.diag(L)).sum() # changed var name on 1.6.2
Run Code Online (Sandbox Code Playgroud)

解释

如果您的错误类似，但不完全一样，或者有人感到好奇，这是我遵循的过程，希望它有所帮助。

异常堆栈指定错误是在 Cholesky 分解过程中触发的。就我而言，这是方法内的这一行_compute_covariance。

事实上，在异常发生后，通过检查covariance和属性的特征值表明具有接近于零的负特征值，因此其逆特征值很大。由于 Cholesky 期望 PD 矩阵，这会引发错误。inv_covnp.eighcovariance

此时，我们可以严重怀疑微小的负特征值是一个数值误差，因为协方差矩阵是 PSD。事实上，当协方差矩阵最初是根据已传递给构造函数的数据集（此处）计算时，就会出现错误源。就我而言，协方差矩阵产生至少 2 个几乎相同的列，这导致由于数值误差而产生剩余负特征值。

您的数据集何时会产生拟奇异协方差矩阵？这个问题在另一篇SE 帖子中得到了完美的解决。底线是：（对ttnphns 的出色工作If some variable is an exact linear combination of the other variables, with constant term allowed, the correlation and covariance matrices of the variables will be singular. The dependency observed in such matrix between its columns is actually that same dependency as the dependency between the variables in the data observed after the variables have been centered (their means brought to 0) or standardized (if we mean correlation rather than covariance matrix)表示感谢和+1 ）。

归档时间：	5 年，5 月前
查看次数：	1635 次
最近记录：	2 年，5 月前