如何解决如何不使用循环制作具有自己指标的距离矩阵? 基准化
我有一个这样的np.arrray:
[[ 1.3,2.7,0.5,NaN,NaN],[ 2.0,8.9,2.5,5.6,3.5],[ 0.6,3.4,9.5,7.4,NaN]]
还有一个用于计算两行之间距离的函数:
def nan_manhattan(X,Y):
nan_diff = np.absolute(X - Y)
length = nan_diff.size
return np.nansum(nan_diff) * length / (length - np.isnan(nan_diff).sum())
我需要所有成对的距离,并且我不想使用循环。我该怎么办?
解决方法
使用pdist:
import numpy as np
from scipy.spatial.distance import pdist,squareform
def nan_manhattan(X,Y):
nan_diff = np.absolute(X - Y)
length = nan_diff.size
return np.nansum(nan_diff) * length / (length - np.isnan(nan_diff).sum())
arr = np.array([[1.3,2.7,0.5,np.nan,np.nan],[2.0,8.9,2.5,5.6,3.5],[0.6,3.4,9.5,7.4,np.nan]])
result = squareform(pdist(arr,nan_manhattan))
print(result)
输出
[[ 0. 14.83333333 17.33333333]
[14.83333333 0. 19.625 ]
[17.33333333 19.625 0. ]]
,
利用broadcasting
-
def manhattan_nan(a):
s = np.nansum(np.abs(a[:,None,:] - a),axis=-1)
m = ~np.isnan(a)
k = m.sum(1)
r = a.shape[1]/np.minimum.outer(k,k)
out = s*r
return out
基准化
从OP的评论来看,用例似乎是一个很高的数组。让我们重现一个,以使用给定的样本数据进行基准测试:
In [2]: a
Out[2]:
array([[1.3,nan,nan],[2.,nan]])
In [3]: a = np.repeat(a,100,axis=0)
# @Dani Mesejo's soln
In [4]: %timeit pdist(a,nan_manhattan)
1.02 s ± 35.7 ms per loop (mean ± std. dev. of 7 runs,1 loop each)
# Naive for-loop version
In [18]: n = a.shape[0]
In [19]: %timeit [[nan_manhattan(a[i],a[j]) for i in range(j+1,n)] for j in range(n)]
991 ms ± 45.6 ms per loop (mean ± std. dev. of 7 runs,1 loop each)
# With broadcasting
In [9]: %timeit manhattan_nan(a)
8.43 ms ± 49.9 µs per loop (mean ± std. dev. of 7 runs,100 loops each)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。