如何解决了解 Gensim Doc2vec 排名
from gensim.test.utils import common_texts
from gensim.models.doc2vec import doc2vec,TaggedDocument
texts = [
"Human machine interface for lab abc computer applications","A survey of user opinion of computer system response time","The EPS user interface management system","System and human system engineering testing of EPS","Relation of user perceived response time to error measurement","The generation of random binary unordered trees","The intersection graph of paths in trees","Graph minors IV Widths of trees and well quasi ordering","Graph minors A survey",]
texts = [t.lower().split() for t in texts]
documents = [TaggedDocument(doc,[i]) for i,doc in enumerate(texts)]
model = doc2vec(documents,epochs=50,vector_size=5,window=2,min_count=2,workers=4)
new_vector = model.infer_vector("human machine interface".split())
for rank,(doc_id,score) in enumerate(model.dv.most_similar_cosmul(positive=[new_vector])):
print('{}. {:.5f} [{}] {}'.format(rank,score,doc_id,' '.join(documents[doc_id].words)))
1. 0.56613 [7] graph minors iv widths of trees and well quasi ordering
2. 0.55941 [6] the intersection graph of paths in trees
3. 0.55061 [2] the eps user interface management system
4. 0.54981 [1] a survey of user opinion of computer system response time
5. 0.52249 [4] relation of user perceived response time to error measurement
6. 0.52240 [8] graph minors a survey
7. 0.49214 [0] human machine interface for lab abc computer applications
8. 0.49016 [3] system and human system engineering testing of eps
9. 0.47899 [5] the generation of random binary unordered trees
为什么包含“人机界面”的文档[0]的排名这么低(第7位)?是语义泛化的结果还是模型需要调整?是否可以使用更大的语料库教程来获得可重复的结果?
解决方法
问题与我之前对类似问题的回答相同:
https://stackoverflow.com/a/66976706/130288
Doc2Vec 需要更多数据才能开始工作。 9 个文本,总共可能有 55 个单词,其中大约一半的唯一单词太小了,无法使用此算法显示任何有趣的结果。
一些 Gensim 特定于 Doc2Vec 的测试用例和教程设法从包含 300 个文档的测试数据集(来自文件 lee_background.cor
)中挤出了一些模糊易懂的相似之处,每个文档都有几百个字 - 几十个数千个单词,其中数千个是独一无二的。但它仍然需要降维和向上epochs,结果仍然很弱。
如果您想从 Doc2Vec 中看到有意义的结果,您应该瞄准数以万计的文档,最好是每个文档包含数十或数百个字。
除此之外的一切都将令人失望,并且不能代表算法设计用于处理什么样的任务。
有一个使用更大的电影评论数据集(100K 文档)的教程,该数据集也在原始“段落向量”论文中使用:
有一个基于 Wikipedia(数百万个文档)的教程,现在可能需要进行一些修复才能在以下位置工作:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。