kha*_*han 60 python pca scikit-learn
我有一个(26424 x 144)数组,我想用Python执行PCA.但是,网上没有特别的地方可以解释如何实现这个任务(有些网站只是按照自己的方式做PCA - 我没有找到这样做的通用方法).任何有任何帮助的人都会做得很好.
dou*_*oug 82
我发布了答案,即使已经接受了另一个答案; 接受的答案依赖于已弃用的功能 ; 此外,这个不推荐使用的函数基于奇异值分解(SVD),它(虽然完全有效)是计算PCA的两种通用技术中更多的存储器和处理器密集型.这在此特别相关,因为OP中的数据阵列的大小.使用基于协方差的PCA,计算流程中使用的数组仅为144 x 144,而不是26424 x 144(原始数据数组的维度).
这是使用SciPy的linalg模块进行PCA的简单工作实现.由于此实现首先计算协方差矩阵,然后对此阵列执行所有后续计算,因此它使用的内存远远少于基于SVD的PCA.
(除了import语句之外,NumPy中的linalg模块也可以在下面的代码中使用,而不会改变代码,这可以来自numpy import linalg作为LA.)
该PCA实施的两个关键步骤是:
计算协方差矩阵 ; 和
取这个cov矩阵的eivenvectors和特征值
在下面的函数中,参数dims_rescaled_data指的是重新缩放的数据矩阵中所需的维数; 此参数的默认值仅为两个维度,但下面的代码不限于两个,但它可以是小于原始数据数组的列号的任何值.
def PCA(data, dims_rescaled_data=2):
"""
returns: data transformed in 2 dims/columns + regenerated original data
pass in: data as 2D NumPy array
"""
import numpy as NP
from scipy import linalg as LA
m, n = data.shape
# mean center the data
data -= data.mean(axis=0)
# calculate the covariance matrix
R = NP.cov(data, rowvar=False)
# calculate eigenvectors & eigenvalues of the covariance matrix
# use 'eigh' rather than 'eig' since R is symmetric,
# the performance gain is substantial
evals, evecs = LA.eigh(R)
# sort eigenvalue in decreasing order
idx = NP.argsort(evals)[::-1]
evecs = evecs[:,idx]
# sort eigenvectors according to same index
evals = evals[idx]
# select the first n eigenvectors (n is desired dimension
# of rescaled data array, or dims_rescaled_data)
evecs = evecs[:, :dims_rescaled_data]
# carry out the transformation on the data using eigenvectors
# and return the re-scaled data, eigenvalues, and eigenvectors
return NP.dot(evecs.T, data.T).T, evals, evecs
def test_PCA(data, dims_rescaled_data=2):
'''
test by attempting to recover original data array from
the eigenvectors of its covariance matrix & comparing that
'recovered' array with the original data
'''
_ , _ , eigenvectors = PCA(data, dim_rescaled_data=2)
data_recovered = NP.dot(eigenvectors, m).T
data_recovered += data_recovered.mean(axis=0)
assert NP.allclose(data, data_recovered)
def plot_pca(data):
from matplotlib import pyplot as MPL
clr1 = '#2026B2'
fig = MPL.figure()
ax1 = fig.add_subplot(111)
data_resc, data_orig = PCA(data)
ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1)
MPL.show()
>>> # iris, probably the most widely used reference data set in ML
>>> df = "~/iris.csv"
>>> data = NP.loadtxt(df, delimiter=',')
>>> # remove class labels
>>> data = data[:,:-1]
>>> plot_pca(data)
Run Code Online (Sandbox Code Playgroud)
下图是虹膜数据上此PCA功能的直观表示.正如您所看到的,2D变换将I类与II类和III类完全分开(但不是II类,而不是II类,实际上需要另一个维度).

Enr*_*eri 49
您可以在matplotlib模块中找到PCA功能:
import numpy as np
from matplotlib.mlab import PCA
data = np.array(np.random.randint(10,size=(10,3)))
results = PCA(data)
Run Code Online (Sandbox Code Playgroud)
结果将存储PCA的各种参数.它来自matplotlib的mlab部分,它是与MATLAB语法的兼容层
编辑:在博客nextgenetics上,我发现了如何使用matplotlib mlab模块执行和显示PCA的精彩演示,玩得开心并检查博客!
Mar*_*ark 18
另一个使用numpy的Python PCA.与@doug相同的想法,但那个没有运行.
from numpy import array, dot, mean, std, empty, argsort
from numpy.linalg import eigh, solve
from numpy.random import randn
from matplotlib.pyplot import subplots, show
def cov(data):
"""
Covariance matrix
note: specifically for mean-centered data
note: numpy's `cov` uses N-1 as normalization
"""
return dot(X.T, X) / X.shape[0]
# N = data.shape[1]
# C = empty((N, N))
# for j in range(N):
# C[j, j] = mean(data[:, j] * data[:, j])
# for k in range(j + 1, N):
# C[j, k] = C[k, j] = mean(data[:, j] * data[:, k])
# return C
def pca(data, pc_count = None):
"""
Principal component analysis using eigenvalues
note: this mean-centers and auto-scales the data (in-place)
"""
data -= mean(data, 0)
data /= std(data, 0)
C = cov(data)
E, V = eigh(C)
key = argsort(E)[::-1][:pc_count]
E, V = E[key], V[:, key]
U = dot(data, V) # used to be dot(V.T, data.T).T
return U, E, V
""" test data """
data = array([randn(8) for k in range(150)])
data[:50, 2:4] += 5
data[50:, 2:5] += 5
""" visualize """
trans = pca(data, 3)[0]
fig, (ax1, ax2) = subplots(1, 2)
ax1.scatter(data[:50, 0], data[:50, 1], c = 'r')
ax1.scatter(data[50:, 0], data[50:, 1], c = 'b')
ax2.scatter(trans[:50, 0], trans[:50, 1], c = 'r')
ax2.scatter(trans[50:, 0], trans[50:, 1], c = 'b')
show()
Run Code Online (Sandbox Code Playgroud)
产生与短得多相同的东西
from sklearn.decomposition import PCA
def pca2(data, pc_count = None):
return PCA(n_components = 4).fit_transform(data)
Run Code Online (Sandbox Code Playgroud)
据我了解,使用特征值(第一种方式)对于高维数据和更少的样本更好,而如果你有更多的样本而不是维度,使用奇异值分解会更好.
Cal*_*eng 13
这是一份工作numpy.
这是一个教程,演示如何使用numpy内置模块完成pincipal组件分析mean,cov,double,cumsum,dot,linalg,array,rank.
http://glowingpython.blogspot.sg/2011/07/principal-component-analysis-with-numpy.html
请注意,scipy这里也有一个很长的解释 - https://github.com/scikit-learn/scikit-learn/blob/babe4a5d0637ca172d47e1dfdd2f6f3c3ecb28db/scikits/learn/utils/extmath.py#L105
与scikit-learn具有更多代码示例库-
https://github.com/scikit-learn/scikit-learn/blob/babe4a5d0637ca172d47e1dfdd2f6f3c3ecb28db/scikits/learn/utils/extmath.py#L105
这是scikit-learn选项。两种方法都使用StandardScaler,因为PCA受比例影响
方法1:让scikit-learn选择最小数量的主成分,以便保留至少x%(在下面的示例中为90%)的方差。
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
iris = load_iris()
# mean-centers and auto-scales the data
standardizedData = StandardScaler().fit_transform(iris.data)
pca = PCA(.90)
principalComponents = pca.fit_transform(X = standardizedData)
# To get how many principal components was chosen
print(pca.n_components_)
Run Code Online (Sandbox Code Playgroud)
方法2:选择主成分的数量(在这种情况下,选择了2个)
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
iris = load_iris()
standardizedData = StandardScaler().fit_transform(iris.data)
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X = standardizedData)
# to get how much variance was retained
print(pca.explained_variance_ratio_.sum())
Run Code Online (Sandbox Code Playgroud)
来源:https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
| 归档时间: |
|
| 查看次数: |
122306 次 |
| 最近记录: |