使用sklearn反转PCA变换(使用whiten = True)

mik*_*ail 10 pca python-2.7 scikit-learn

通常PCA变换很容易被反转:

import numpy as np
from sklearn import decomposition

x = np.zeros((500, 10))
x[:, :5] = random.rand(500, 5)
x[:, 5:] = x[:, :5] # so that using PCA would make sense

p = decomposition.PCA()
p.fit(x)

a = x[5, :]

print p.inverse_transform(p.transform(a)) - a  # this yields small numbers (about 10**-16)
Run Code Online (Sandbox Code Playgroud)

现在,如果我们尝试添加whiten = True参数,结果将完全不同:

p = decomposition.PCA(whiten=True)
p.fit(x)

a = x[5, :]

print p.inverse_transform(p.transform(a)) - a  # now yields numbers about 10**15
Run Code Online (Sandbox Code Playgroud)

所以,因为我没有找到任何其他可以做到这一点的方法,所以我觉得如何才能获得原始值?或者它甚至可能吗?非常感谢您的帮助.

eic*_*erg 14

这种行为无疑是可能很奇怪的,但它仍然记录在相关功能的文档字符串中.

PCA关于以下内容的类docstring 说whiten:

whiten : bool, optional
    When True (False by default) the `components_` vectors are divided
    by n_samples times singular values to ensure uncorrelated outputs
    with unit component-wise variances.

    Whitening will remove some information from the transformed signal
    (the relative variance scales of the components) but can sometime
    improve the predictive accuracy of the downstream estimators by
    making there data respect some hard-wired assumptions.
Run Code Online (Sandbox Code Playgroud)

代码和文档字符串PCA.inverse_transform说:

def inverse_transform(self, X):
    """Transform data back to its original space, i.e.,
    return an input X_original whose transform would be X

    Parameters
    ----------
    X : array-like, shape (n_samples, n_components)
        New data, where n_samples is the number of samples
        and n_components is the number of components.

    Returns
    -------
    X_original array-like, shape (n_samples, n_features)

    Notes
    -----
    If whitening is enabled, inverse_transform does not compute the
    exact inverse operation as transform.
    """
    return np.dot(X, self.components_) + self.mean_
Run Code Online (Sandbox Code Playgroud)

现在看一下whiten=True函数中发生的事情PCA._fit:

    if self.whiten:
        self.components_ = V / S[:, np.newaxis] * np.sqrt(n_samples)
    else:
        self.components_ = V
Run Code Online (Sandbox Code Playgroud)

S奇异值在哪里,V是奇异向量.根据定义,白化使频谱水平,基本上将协方差矩阵的所有特征值设置为1.

为了最终回答你的问题:PCAsklearn.decomposition 的对象不允许从白化矩阵重建原始数据,因为中心数据的奇异值/协方差矩阵的特征值在函数之后被垃圾收集PCA._fit.

但是,如果S手动获取奇异值,则可以将它们相乘并返回原始数据.

试试这个

import numpy as np
rng = np.random.RandomState(42)

n_samples_train, n_features = 40, 10
n_samples_test = 20
X_train = rng.randn(n_samples_train, n_features)
X_test = rng.randn(n_samples_test, n_features)

from sklearn.decomposition import PCA
pca = PCA(whiten=True)

pca.fit(X_train)

X_train_mean = X_train.mean(0)
X_train_centered = X_train - X_train_mean
U, S, VT = np.linalg.svd(X_train_centered, full_matrices=False)
components = VT / S[:, np.newaxis] * np.sqrt(n_samples_train)

from numpy.testing import assert_array_almost_equal
# These assertions will raise an error if the arrays aren't equal
assert_array_almost_equal(components, pca.components_)  # we have successfully 
                                                        # calculated whitened components

transformed = pca.transform(X_test)
inverse_transformed = transformed.dot(S[:, np.newaxis] ** 2 * pca.components_ /
                                            n_samples_train) + X_train_mean

assert_array_almost_equal(inverse_transformed, X_test)  # We have equality
Run Code Online (Sandbox Code Playgroud)

从行创建中可以看出inverse_transformed,如果将奇异值乘以组件,则可以返回到原始空间.

事实上,奇异值S实际上隐藏在组件的规范中,因此不需要沿着侧面计算SVD PCA.使用上面的定义可以看到

S_recalculated = 1. / np.sqrt((pca.components_ ** 2).sum(axis=1) / n_samples_train)
assert_array_almost_equal(S, S_recalculated)
Run Code Online (Sandbox Code Playgroud)

结论:通过获得居中数据矩阵的奇异值,我们可以撤消白化并转换回原始空间.但是,此功能不是在PCA对象中本机实现的.

补救措施:在不修改scikit学习代码的情况下(如果被社区认为有用,可以正式完成),您正在寻找的解决方案就是这个(我现在将使用您的代码和变量名称,请检查这是否适合您):

transformed_a = p.transform(a)
singular_values = 1. / np.sqrt((p.components_ ** 2).sum(axis=1) / len(x))
inverse_transformed = np.dot(transformed_a, singular_values[:, np.newaxis] ** 2 *
                                          p.components_ / len(x)) + p.mean_)
Run Code Online (Sandbox Code Playgroud)

(恕我直言inverse_transform,任何估算器的功能都应该尽可能地接近原始数据.在这种情况下,明确地存储奇异值也不会花费太多,所以也许这个功能实际上应该添加到sklearn.)

编辑中心矩阵的奇异值不是最初想到的垃圾收集.事实上,它们存储在pca.explained_variance_并且可以用于解决问题.看评论.

  • 重新阅读你的评论,确切地说,美白不仅仅是变换组件以具有单位方差(这很容易通过`sklearn.preprocessing.StandardScaler`来完成,并且很容易进行逆变换).它是高斯意义上的去相关:白化特征的协方差矩阵将是对角线 (2认同)