如何在PCAPython中计算最佳特征数？

如何解决如何在PCAPython中计算最佳特征数？

我正在对78个变量的数据集进行PCA预处理。如何计算PCA变量的最优值？

我的第一个想法是从5开始，然后逐步提高并计算精度。但是，出于明显的原因，这不是一种有效的计算方法。

有人有什么建议/经验吗？甚至没有一种计算最佳值的方法？

解决方法

首先查看数据集分布，然后使用explained_variance_查找组成部分的数量。

从将样本投影到二维图形开始。

假设我有一个40个人的面部数据集（Olivetti-faces），每个人都有10个样本。总计400张图像。我们将拆分280列火车和120个测试样本。

from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split


olivetti = fetch_olivetti_faces()

x = olivetti.images  # Train
y = olivetti.target  # Labels

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=42)

x_train = x_train.reshape((x_train.shape[0],x.shape[1] * x.shape[2]))
x_test = x_test.reshape((x_test.shape[0],x.shape[1] * x.shape[2]))
x = x.reshape((x.shape[0]),x.shape[1] * x.shape[2])

现在我们想看看像素如何分布。为了清楚理解，我们将在二维图中显示像素。

from sklearn.decomposition import PCA
from matplotlib.pyplot import figure,get_cmap,colorbar,show

class_num = 40
sample_num = 10

pca = PCA(n_components=2).fit_transform(x)
idx_range = class_num * sample_num
fig = figure(figsize=(6,3),dpi=300)
ax = fig.add_subplot(1,1,1)
c_map = get_cmap(name='jet',lut=class_num)
scatter = ax.scatter(pca[:idx_range,0],pca[:idx_range,1],c=y[:idx_range],s=10,cmap=c_map)

ax.set_xlabel("First Principal Component")
ax.set_ylabel("Second Principal Component")
ax.set_title("PCA projection of {} people".format(class_num))
colorbar(mappable=scatter)
show()

我们可以说40个人，每个人有10个样本，只有两个主要成分是无法区分的。
请记住，我们是从主数据集中创建此图的，既不是训练也不是测试。

我们需要多少主要成分才能清楚地区分数据？
- 为回答上述问题，我们将使用explained_variance_。
- 来自documentation：
  
  每个选定组件说明的方差量。等于X协方差矩阵的n个分量的最大特征值。
- ```
from matplotlib.pyplot import plot,xlabel,ylabel

pca2 = PCA().fit(x)
plot(pca2.explained_variance_,linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
show()
```
- 从上图中，我们可以看到PCA区分了100个组成部分。

简化代码：

from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

x,_ = fetch_olivetti_faces(return_X_y=True)
pca2 = PCA().fit(x)
plt.plot(pca2.explained_variance_,linewidth=2)
plt.xlabel('Components')
plt.ylabel('Explained Variances')
plt.show()

如何在PCAPython中计算最佳特征数？

如何解决如何在PCAPython中计算最佳特征数？

解决方法

相关推荐