我有一个如下数据帧:帧的形状是(1510,1399).列表示产品,行表示用户为给定产品分配的值(0或1).我怎样才能计算jaccard_similarity_score?
我创建了一个列出产品与产品的占位符数据框
data_ibs = pd.DataFrame(index=data_g.columns,columns=data_g.columns)
我不知道如何通过data_ibs迭代来计算相似性.
for i in range(0,len(data_ibs.columns)) : # Loop through the columns for each column for j in range(0,len(data_ibs.columns)) : .........
解决方法
简短和矢量化(快速)答案:
使用scikit的成对距离’汉明’学习:
from sklearn.metrics.pairwise import pairwise_distances jac_sim = 1 - pairwise_distances(df.T,metric = "hamming") # optionally convert it to a DataFrame jac_sim = pd.DataFrame(jac_sim,index=df.columns,columns=df.columns)
说明:
假设这是您的数据集:
import pandas as pd import numpy as np np.random.seed(0) df = pd.DataFrame(np.random.binomial(1,0.5,size=(100,5)),columns=list('ABCDE')) print(df.head()) A B C D E 0 1 1 1 1 0 1 1 0 1 1 0 2 1 1 1 1 0 3 0 0 1 1 1 4 1 1 0 1 0
使用sklearn的jaccard_similarity_score,A列和B列之间的相似性为:
from sklearn.metrics import jaccard_similarity_score print(jaccard_similarity_score(df['A'],df['B'])) 0.43
这是与总行数100相同的值的行数.
据我所知,没有成对版本的jaccard_similarity_score,但有成对版本的距离.
但是,SciPy将Jaccard distance定义如下:
Given two vectors,u and v,the Jaccard distance is the proportion of those elements u[i] and v[i] that disagree where at least one of them is non-zero.
因此它排除了两列都有0值的行. jaccard_similarity_score没有.另一方面,汉明距离与相似性定义一致:
The proportion of those vector elements between two n-vectors u and v
which disagree.
所以如果你想计算jaccard_similarity_score,你可以使用1 – 汉明:
from sklearn.metrics.pairwise import pairwise_distances print(1 - pairwise_distances(df.T,metric = "hamming")) array([[ 1.,0.43,0.61,0.55,0.46],[ 0.43,1.,0.52,0.56,0.49],[ 0.61,0.48,0.53],[ 0.55,[ 0.46,0.49,0.53,1. ]])
在DataFrame格式中:
jac_sim = 1 - pairwise_distances(df.T,metric = "hamming") jac_sim = pd.DataFrame(jac_sim,columns=df.columns) # jac_sim = np.triu(jac_sim) to set the lower diagonal to zero # jac_sim = np.tril(jac_sim) to set the upper diagonal to zero A B C D E A 1.00 0.43 0.61 0.55 0.46 B 0.43 1.00 0.52 0.56 0.49 C 0.61 0.52 1.00 0.48 0.53 D 0.55 0.56 0.48 1.00 0.49 E 0.46 0.49 0.53 0.49 1.00
您可以通过迭代列的组合来执行相同操作,但速度会慢得多.
import itertools sim_df = pd.DataFrame(np.ones((5,columns=df.columns) for col_pair in itertools.combinations(df.columns,2): sim_df.loc[col_pair] = sim_df.loc[tuple(reversed(col_pair))] = jaccard_similarity_score(df[col_pair[0]],df[col_pair[1]]) print(sim_df) A B C D E A 1.00 0.43 0.61 0.55 0.46 B 0.43 1.00 0.52 0.56 0.49 C 0.61 0.52 1.00 0.48 0.53 D 0.55 0.56 0.48 1.00 0.49 E 0.46 0.49 0.53 0.49 1.00
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。