O.r*_*rka 10 python pca dimensionality-reduction scikit-learn biplot
我在R
w /中看到了这个教程autoplot
.他们绘制了负载和加载标签:
autoplot(prcomp(df), data = iris, colour = 'Species',
loadings = TRUE, loadings.colour = 'blue',
loadings.label = TRUE, loadings.label.size = 3)
Run Code Online (Sandbox Code Playgroud)
https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html
我更喜欢Python 3
w/matplotlib, scikit-learn, and pandas
进行数据分析.但是,我不知道如何添加这些?
你怎么能用这些载体绘制matplotlib
?
我一直在阅读使用sklearn在PCA中恢复explain_variance_ratio_的功能名称,但尚未弄清楚
这是我如何绘制它 Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = 2
# PCA (How I tend to set it up)
Mod_PCA = decomposition.PCA(n_components=m)
DF_PCA = pd.DataFrame(Mod_PCA.fit_transform(DF_standard),
columns=["PC%d" % k for k in range(1,m + 1)]).iloc[:,:K]
# Color classes
color_list = [{0:"r",1:"g",2:"b"}[x] for x in Se_targets]
fig, ax = plt.subplots()
ax.scatter(x=DF_PCA["PC1"], y=DF_PCA["PC2"], color=color_list)
Run Code Online (Sandbox Code Playgroud)
mak*_*kis 18
biplot
函数来执行以下操作.在这个例子中,我使用的是虹膜数据:import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
# In general, it's a good idea to scale the data prior to PCA.
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
x_new = pca.fit_transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex,ys * scaley, c = y)
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()
Run Code Online (Sandbox Code Playgroud)
结果
试试“pca”库。这将绘制解释的方差,并创建一个双标图。
pip install pca
from pca import pca
# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)
# Or reduce the data towards 2 PCs
model = pca(n_components=2)
# Fit transform
results = model.fit_transform(X)
# Plot explained variance
fig, ax = model.plot()
# Scatter first 2 PCs
fig, ax = model.scatter()
# Make biplot with the number of features
fig, ax = model.biplot(n_feat=4)
Run Code Online (Sandbox Code Playgroud)
我想为此主题添加一个通用解决方案。在对现有解决方案(包括 Python 和 R)和数据集(尤其是生物“组学”数据集)进行了一些仔细研究之后。我想出了以下Python解决方案,它的优点是:
正确缩放分数(样本)和负载(特征),使它们在一张图中看起来赏心悦目。需要指出的是,样本和特征的相对尺度没有任何数学意义(但它们的相对方向有),然而,使它们大小相似可以方便探索。
可以处理具有许多特征的高维数据,并且只能可视化驱动数据变化最大的前几个特征(箭头)。这涉及到顶级特征的显式选择和缩放。
最终输出的示例(使用我研究领域的经典数据集“ Moving Pictures ”):
准备:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
Run Code Online (Sandbox Code Playgroud)
我们将使用 iris 数据集(150 个样本,4 个特征)。
# load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
targets = iris.target_names
features = iris.feature_names
# standardization
X_scaled = StandardScaler().fit_transform(X)
# PCA
pca = PCA(n_components=2).fit(X_scaled)
X_reduced = pca.transform(X_scaled)
# coordinates of samples (i.e., scores; let's take the first two axes)
scores = X_reduced[:, :2]
# coordinates of features (i.e., loadings; note the transpose)
loadings = pca.components_[:2].T
# proportions of variance explained by axes
pvars = pca.explained_variance_ratio_[:2] * 100
Run Code Online (Sandbox Code Playgroud)
关键部分来了:正确缩放特征(箭头)以匹配样本(点)。以下代码按每个轴上样本的最大绝对值进行缩放。
arrows = loadings * np.abs(scores).max(axis=0)
Run Code Online (Sandbox Code Playgroud)
正如塞拉卢克的回答中所讨论的,另一种方法是按范围(最大 - 最小)进行缩放。但它会使箭头大于点。
# arrows = loadings * np.ptp(scores, axis=0)
Run Code Online (Sandbox Code Playgroud)
然后绘制点和箭头:
plt.figure(figsize=(5, 5))
# samples as points
for i, name in enumerate(targets):
plt.scatter(*zip(*scores[y == i]), label=name)
plt.legend(title='Species')
# empirical formula to determine arrow width
width = -0.0075 * np.min([np.subtract(*plt.xlim()), np.subtract(*plt.ylim())])
# features as arrows
for i, arrow in enumerate(arrows):
plt.arrow(0, 0, *arrow, color='k', alpha=0.5, width=width, ec='none',
length_includes_head=True)
plt.text(*(arrow * 1.05), features[i],
ha='center', va='center')
# axis labels
for i, axis in enumerate('xy'):
getattr(plt, f'{axis}ticks')([])
getattr(plt, f'{axis}label')(f'PC{i + 1} ({pvars[i]:.2f}%)')
Run Code Online (Sandbox Code Playgroud)
将结果与 R 解的结果进行比较。可以看到它们是非常一致的。(注:众所周知,R和scikit-learn的PCA具有相反的轴。您可以翻转其中一个以使方向一致。)
iris.pca <- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE)
biplot(iris.pca, scale = 0)
Run Code Online (Sandbox Code Playgroud)
library(ggfortify)
autoplot(iris.pca, data = iris, colour = 'Species',
loadings = TRUE, loadings.colour = 'dimgrey',
loadings.label = TRUE, loadings.label.colour = 'black')
Run Code Online (Sandbox Code Playgroud)
我们将使用数字数据集(1797 个样本,64 个特征)。
iris.pca <- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE)
biplot(iris.pca, scale = 0)
Run Code Online (Sandbox Code Playgroud)
现在,我们将找到最能解释我们数据的前k 个特征。
library(ggfortify)
autoplot(iris.pca, data = iris, colour = 'Species',
loadings = TRUE, loadings.colour = 'dimgrey',
loadings.label = TRUE, loadings.label.colour = 'black')
Run Code Online (Sandbox Code Playgroud)
方法 1:查找可见图中出现最长(即距原点最远)的前k 个箭头:
# load data
digits = datasets.load_digits()
X = digits.data
y = digits.target
targets = digits.target_names
features = digits.feature_names
# analysis
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=2).fit(X_scaled)
X_reduced = pca.transform(X_scaled)
# results
scores = X_reduced[:, :2]
loadings = pca.components_[:2].T
pvars = pca.explained_variance_ratio_[:2] * 100
Run Code Online (Sandbox Code Playgroud)
方法 2:查找在可见 PC 中造成最大方差的前k 个特征:
k = 8
Run Code Online (Sandbox Code Playgroud)
现在有一个新问题:当特征数量很大时,由于前k个特征只占所有特征的很小一部分,它们对数据方差的贡献很小,因此它们在图中看起来很小。
为了解决这个问题,我想出了以下代码。基本原理是:对于所有特征,每台 PC 的载荷平方和始终为 1。对于一小部分特征,我们应该将它们调高,使它们的平方载荷之和也为 1。此方法经过测试且有效,并生成漂亮的图。
tops = (loadings ** 2).sum(axis=1).argsort()[-k:]
arrows = loadings[tops]
Run Code Online (Sandbox Code Playgroud)
然后我们将缩放箭头以匹配样本(如上所述):
# tops = (loadings * pvars).sum(axis=1).argsort()[-k:]
# arrows = loadings[tops]
Run Code Online (Sandbox Code Playgroud)
现在我们可以渲染双图:
arrows /= np.sqrt((arrows ** 2).sum(axis=0))
Run Code Online (Sandbox Code Playgroud)
我希望我的回答对社区有用。
归档时间: |
|
查看次数: |
11531 次 |
最近记录: |