如何解决如何将 PCA 与 SOM 结合以在 python 中获得适当的数据点集群?
我有大约 5 个不同的案例,我从每个案例中提取了大约 13/14 的统计特征。我想创建类似异常检测,我使用主成分分析 (PCA) 减少特征矩阵,我想使用自组织映射 (SOM) 来帮助组织集群,使其变得更加清晰,然后我想到了实施使用以下方法进行异常检测(我从这个链接中得到:Machine learning for anomaly detection and condition monitoring):
- 马氏距离度量
- 自编码器模型
以下问题是:
- 我的方法正确吗? (见下面的代码)
- 如何从 SOM 获取数据点,以便对来自 SOM 的新数据点进行马氏距离度量?
- 如何找到要在 SOM 中使用的正确参数?
- 如果来自 SOM 的矩阵有负值怎么办?
- 您能解释一下什么是量化误差吗?它越小越好吗?因为我一直在这个范围内出错:
quantization error: 0.8791745577185559
代码:
def cov_matrix(data,verbose=False):
covariance_matrix = np.cov(data,rowvar=False)
if is_pos_def(covariance_matrix):
inv_covariance_matrix = np.linalg.inv(covariance_matrix)
if is_pos_def(inv_covariance_matrix):
return covariance_matrix,inv_covariance_matrix
else:
print("Error: Inverse of Covariance Matrix is not positive definite!")
else:
print("Error: Covariance Matrix is not positive definite!")
def Mahalanobisdist(inv_cov_matrix,mean_distr,data,verbose=False):
inv_covariance_matrix = inv_cov_matrix
vars_mean = mean_distr
diff = data - vars_mean
md = []
for i in range(len(diff)):
md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i])))
return md
def MD_detectOutliers(dist,extreme=False,verbose=False):
k = 3. if extreme else 2.
threshold = np.mean(dist) * k
outliers = []
for i in range(len(dist)):
if dist[i] >= threshold:
outliers.append(i) # index of the outlier
return np.array(outliers)
def MD_threshold(dist,verbose=False):
k = 3. if extreme else 2.
threshold = np.mean(dist) * k
return threshold
def is_pos_def(A):
if np.allclose(A,A.T):
try:
np.linalg.cholesky(A)
return True
except np.linalg.LinAlgError:
return False
else:
return False
## Get the Statistical features
## Form matrix
## Obtain the principal components
## Do SOM to the principal components (I am using miniSOM)
# Initialization of SOM and training:
som_shape = (1,5)
full_PCA_dataframe_np = full_pca_dataframe.to_numpy()
som = MiniSom(som_shape[0],som_shape[1],full_PCA_dataframe_np.shape[1],sigma=.4,learning_rate=.15,neighborhood_function='gaussian')
som.train_batch(full_PCA_dataframe_np,8000,verbose=True)
# each neuron represents a cluster
winner_coordinates = np.array([som.winner(x) for x in full_PCA_dataframe_np]).T
# with np.ravel_multi_index we convert the bidimensional coordinates to a monodimensional index
cluster_index = np.ravel_multi_index(winner_coordinates,som_shape)
# plotting the clusters using the first 2 dimentions of the data
for c in np.unique(cluster_index):
plt.scatter(full_PCA_dataframe_np[cluster_index == c,0],full_PCA_dataframe_np[cluster_index == c,1],label='cluster='+str(c),alpha=.5)
# plotting centroids
for centroid in som.get_weights():
plt.scatter(centroid[:,centroid[:,marker='x',s=25,linewidths=5,color='k',label='centroid')
plt.legend()
plt.show()
## Get the datapoints and Implement the Mahalanobis distance metric on each case:
data_train = np.array(X_train_PCA.values) # Say Case 1
data_test = np.array(X_test_PCA.values) # Say Case 3
# Obtain the covaraince matrix and implement Mahalanobis distance:
cov_matrix,inv_cov_matrix = cov_matrix(data_train)
mean_distr = data_train.mean(axis=0)
dist_test = Mahalanobisdist(inv_cov_matrix,data_test,verbose=False)
dist_train = Mahalanobisdist(inv_cov_matrix,data_train,verbose=False)
threshold = MD_threshold(dist_train,extreme = True)
# Form matrix with anomaly column:
anomaly_train = pd.DataFrame()
anomaly_train['Mob dist']= dist_train
anomaly_train['Thresh'] = threshold
# If Mob dist above threshold: Flag as anomaly
anomaly_train['Anomaly'] = anomaly_train['Mob dist'] > anomaly_train['Thresh']
anomaly_train.index = X_train_PCA.index
anomaly = pd.DataFrame()
anomaly['Mob dist']= dist_test
anomaly['Thresh'] = threshold
# If Mob dist above threshold: Flag as anomaly
anomaly['Anomaly'] = anomaly['Mob dist'] > anomaly['Thresh']
anomaly.index = X_test_PCA.index
anomaly.head()
关于 SOM 的另一个问题,我使用 PCA 对 SOM 的输入是 50 行和 2 列,其中我有 5 个集群。涉及到 SOM 时我需要输入什么?
这是我使用 miniSOM
的代码:
# Initialization of SOM and training:
som_shape = (7,7)
full_PCA_dataframe_np = full_pca_dataframe.to_numpy()
som = MiniSom(som_shape[0],sigma=.5,learning_rate=.5,neighborhood_function='gaussian')
som.train_batch(full_PCA_dataframe_np,verbose=True)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。