如何解决使用sklearn.metrics.pairwise_distances或scipy.spatial.distance.cdist估计大型daraset的成对距离
我正在尝试估计约300,000张图像的数据集的特征到数据子集的特征之间的成对距离,以计算其最小值。但是,计算出的距离矩阵太大,无法容纳在内存中,从而导致OOM错误。
对于上下文,我正在尝试采用kcenter_greedy方法进行主动学习,如此处所示(https://github.com/google/active-learning/blob/master/sampling_methods/kcenter_greedy.py)
人们对此进行评估的大多数学术数据集都很小,因此代码可以正常工作。就我而言,我想使用一个更大的数据集,而该数据集的sklearn.metrics.pairwise_distances
函数没有那么有用。
这是代码的相关部分
def update_distances(self,cluster_centers,only_new=True,reset_dist=False):
"""Update min distances given cluster centers.
Args:
cluster_centers: indices of cluster centers
only_new: only calculate distance for newly selected points and update
min_distances.
rest_dist: whether to reset min_distances.
"""
if reset_dist:
self.min_distances = None
if only_new:
cluster_centers = [d for d in cluster_centers
if d not in self.already_selected]
if cluster_centers:
# Update min_distances for all examples given new cluster center.
x = self.features[cluster_centers]
################ This is the section causing OOM issues######################################
dist = pairwise_distances(self.features,x,metric=self.metric)
if self.min_distances is None:
self.min_distances = np.min(dist,axis=1).reshape(-1,1)
else:
self.min_distances = np.minimum(self.min_distances,dist)
def select_batch_(self,model,already_selected,N,**kwargs):
"""
diversity promoting active learning method that greedily forms a batch
to minimize the maximum distance to a cluster center among all unlabeled
datapoints.
Args:
model: model with scikit-like API with decision_function implemented
already_selected: index of datapoints already selected
N: batch size
Returns:
indices of points selected to minimize distance to cluster centers
"""
try:
# Assumes that the transform function takes in original data and not
# flattened data.
print('Getting transformed features...')
self.features = model.transform(self.X)
print('Calculating distances...')
self.update_distances(already_selected,only_new=False,reset_dist=True)
except:
print('Using flat_X as features.')
self.update_distances(already_selected,reset_dist=False)
new_batch = []
for _ in range(N):
if self.already_selected is None:
# Initialize centers with a randomly selected datapoint
ind = np.random.choice(np.arange(self.n_obs))
else:
ind = np.argmax(self.min_distances)
# New examples should not be in already selected since those points
# should have min_distance of zero to a cluster center.
assert ind not in already_selected
self.update_distances([ind],reset_dist=False)
new_batch.append(ind)
print('Maximum distance from cluster centers is %0.2f'
% max(self.min_distances))
self.already_selected = already_selected
return new_batch
我也尝试使用scipy.spatial.distance.cdist
函数,但这对OOM问题没有帮助。
是否有更好的方法可以更有效地找到最小距离以存储内存?过去,scikit-learn
库中的优化为我提供了帮助,但在这种情况下,它似乎不适用于大型数据集。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。