2D PCA line fitting with numpy

lhk*_*lhk 4 python math regression numpy pca

I'm trying to implement a 2D PCA with numpy. The code is rather simple:

import numpy as np

n=10
d=10
x=np.linspace(0,10,n)
y=x*d

covmat = np.cov([x,y])
print(covmat)

eig_values, eig_vecs = np.linalg.eig(covmat)
largest_index = np.argmax(eig_values)
largest_eig_vec = eig_vecs[largest_index]
Run Code Online (Sandbox Code Playgroud)

The covariance matrix is:

[[   11.31687243   113.16872428]
 [  113.16872428  1131.6872428 ]]
Run Code Online (Sandbox Code Playgroud)

Then I've got a simple helper method to plot a line (as a series of points) around a given center, in a given direction. This is meant to be used by pyplot, therefore I'm preparing separate lists for the x and y coordinate.

def plot_line(center, dir, num_steps, step_size):
    line_x = []
    line_y = []
    for i in range(num_steps):
        dist_from_center = step_size * (i - num_steps / 2)
        point_on_line = center + dist_from_center * dir
        line_x.append(point_on_line[0])
        line_y.append(point_on_line[1])
    return (line_x, line_y)
Run Code Online (Sandbox Code Playgroud)

And finally the plot setup:

lines = []
mean_point=np.array([np.mean(x),np.mean(y)])
lines.append(plot_line(mean_point, largest_eig_vec, 200, 0.5))

import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(x,y, c="b", marker=".", s=10
           )
for line in lines:
    ax.plot(line[0], line[1], c="r")

ax.scatter(mean_point[0], mean_point[1], c="y", marker="o", s=20)

plt.axes().set_aspect('equal', 'datalim')
plt.show()
Run Code Online (Sandbox Code Playgroud)

Unfortunately, the PCA doesn't seem to work. Here's the plot:

pca线配件

I'm afraid I've got no idea what went wrong.

  • I've computed the covariance manually -> same result.
  • I've checked the other eigenvalue -> perpendicular to the red line.
  • I've tested plot_line with the direction (1,10). It's perfectly aligned to my points: 完美对齐

The final plot shows that the line fitted by pca is the correct result only it is mirrored at the y axis.

In fact, if I change the x coordinate of the eigenvector, the line is fitted perfectly:

完美契合

Apparently this is a fundamental problem. Somehow I've misunderstood how to use pca.

Where is my mistake ? Online resources seem to describe PCA exactly as I implemented it. I don't believe I have to categorically mirror my line-fits at the y-axis. It's got to be something else.

Mar*_*son 5

您的错误是您正在提取特征向量数组的最后一行。但是,特征向量形成返回的特征向量数组np.linalg.eig,而不是行。从文档中

[...]数组a,w和v满足方程式dot(a[:,:], v[:,i]) = w[i] * v[:,i][对于每个i]

where anp.linalg.eig应用到的数组,是w特征值的1d数组,v是特征向量的2d数组。因此,列v[:, i]是特征向量。

在这种简单的二维情况下,由于两个特征向量是相互正交的(因为我们以对称矩阵开始)和单位长度是相互正交的(因此将np.linalg.eig它们归一化),所以特征向量数组具有以下两种形式之一

[[ cos(t)  sin(t)]
 [-sin(t)  cos(t)]]
Run Code Online (Sandbox Code Playgroud)

要么

[[ cos(t)  sin(t)]
 [ sin(t) -cos(t)]]
Run Code Online (Sandbox Code Playgroud)

对于某些实数t,在第一种情况下,读取第一行(例如)而不是第一列将[cos(t), sin(t)]代替[cos(t), -sin(t)]。这解释了您所看到的明显反射。

更换线

largest_eig_vec = eig_vecs[largest_index]
Run Code Online (Sandbox Code Playgroud)

largest_eig_vec = eig_vecs[:, largest_index]
Run Code Online (Sandbox Code Playgroud)

并且您应该得到预期的结果。