Why Sklearn PCA needs more samples than new features(n_components)?

Question

When using Sklearn PCA algorithm like this

x_orig = np.random.choice([0,1],(4,25),replace = True)
pca = PCA(n_components=15)
pca.fit_transform(x_orig).shape

I get output

(4, 4)

I expected(want) it to be:

(4,15)

I get why its happening. In the documentation of sklearn (here) it says(assuming their '==' is assignment operator):

n_components == min(n_samples, n_features)

但他们为什么要这样做？另外，如何将形状为 [1,25] 的输入直接转换为 [1,10]（不堆叠虚拟数组）？

Answer 1

每个主成分是数据在数据协方差矩阵的特征向量上的投影。如果您的样本数n少于特征数，则协方差矩阵只有n个非零特征值。因此，只有n 个有意义的特征向量/分量。

原则上，可能有比样本更多的分量，但多余的分量将是无用的噪声。

Scikit-learn 会引发错误，而不是默默地做任何事情。这可以防止用户用脚射击自己。样本少于特征可能表明数据存在问题，或者对所涉及的方法有误解。