了解 Gensim Doc2vec 排名

如何解决了解 Gensim Doc2vec 排名

我使用 gensim 4.0.1 并遵循教程 1 和 2：

from gensim.test.utils import common_texts
from gensim.models.doc2vec import doc2vec,TaggedDocument

texts = [
    "Human machine interface for lab abc computer applications","A survey of user opinion of computer system response time","The EPS user interface management system","System and human system engineering testing of EPS","Relation of user perceived response time to error measurement","The generation of random binary unordered trees","The intersection graph of paths in trees","Graph minors IV Widths of trees and well quasi ordering","Graph minors A survey",]

texts = [t.lower().split() for t in texts]

documents = [TaggedDocument(doc,[i]) for i,doc in enumerate(texts)]
model = doc2vec(documents,epochs=50,vector_size=5,window=2,min_count=2,workers=4)

new_vector = model.infer_vector("human machine interface".split())


for rank,(doc_id,score) in enumerate(model.dv.most_similar_cosmul(positive=[new_vector])):
        print('{}. {:.5f} [{}] {}'.format(rank,score,doc_id,' '.join(documents[doc_id].words)))


1. 0.56613 [7] graph minors iv widths of trees and well quasi ordering
2. 0.55941 [6] the intersection graph of paths in trees
3. 0.55061 [2] the eps user interface management system
4. 0.54981 [1] a survey of user opinion of computer system response time
5. 0.52249 [4] relation of user perceived response time to error measurement
6. 0.52240 [8] graph minors a survey
7. 0.49214 [0] human machine interface for lab abc computer applications
8. 0.49016 [3] system and human system engineering testing of eps
9. 0.47899 [5] the generation of random binary unordered trees

为什么包含“人机界面”的文档[0]的排名这么低（第7位）？是语义泛化的结果还是模型需要调整？是否可以使用更大的语料库教程来获得可重复的结果？

解决方法

问题与我之前对类似问题的回答相同：

https://stackoverflow.com/a/66976706/130288

Doc2Vec 需要更多数据才能开始工作。 9 个文本，总共可能有 55 个单词，其中大约一半的唯一单词太小了，无法使用此算法显示任何有趣的结果。

一些 Gensim 特定于 Doc2Vec 的测试用例和教程设法从包含 300 个文档的测试数据集（来自文件 lee_background.cor）中挤出了一些模糊易懂的相似之处，每个文档都有几百个字 - 几十个数千个单词，其中数千个是独一无二的。但它仍然需要降维和向上epochs，结果仍然很弱。

如果您想从 Doc2Vec 中看到有意义的结果，您应该瞄准数以万计的文档，最好是每个文档包含数十或数百个字。

除此之外的一切都将令人失望，并且不能代表算法设计用于处理什么样的任务。

有一个使用更大的电影评论数据集（100K 文档）的教程，该数据集也在原始“段落向量”论文中使用：

https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html#sphx-glr-auto-examples-howtos-run-doc2vec-imdb-py

有一个基于 Wikipedia（数百万个文档）的教程，现在可能需要进行一些修复才能在以下位置工作：

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

了解 Gensim Doc2vec 排名

如何解决了解 Gensim Doc2vec 排名

解决方法

相关推荐