如何解决不同集合文档的成对相似度
我正在尝试为不同数据集的文档实现 cosine_similarity
。每个集合有 30 个文档,我有兴趣匹配 documents
和 documents2
之间的相似文档。
到目前为止,我的方法是这样的:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import sys
import glob
import codecs
from collections import defaultdict
from collections import Counter
from nltk import word_tokenize
import nltk
import re
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from sklearn.metrics.pairwise import cosine_similarity
from contextlib import ExitStack
#max_df = 29 set to ignore terms that appear in more than 29 documents
#TFIDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=29)
#Create list of documents to work with
path = "C:\\Users\\path\\Desktop\\research\\dataset\\1"
text_files = [f for f in os.listdir(path) if f.endswith('.txt')]
documents = [os.path.join(path,name) for name in text_files]
with ExitStack() as stack:
files = [stack.enter_context(open(filename,encoding="utf-8")).read() for filename in documents]
X = tfidf_vectorizer.fit_transform(files)
path2 = "C:\\Users\\path\\Desktop\\research\\dataset\\2"
text_files2 = [f for f in os.listdir(path2) if f.endswith('.txt')]
documents2 = [os.path.join(path2,name) for name in text_files2]
with ExitStack() as stack:
files2 = [stack.enter_context(open(filename,encoding="utf-8")).read() for filename in documents2]
X2 = tfidf_vectorizer.fit_transform(files2)
#X = X.reshape(-1,1)
#X2 = X2.reshape(-1,1)
sm = cosine_similarity(X,X2)
我收到以下错误 ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 5068 while Y.shape[1] == 4479
。
如果我取消注释 reshape
语句,我会得到 numpy.core._exceptions.MemoryError: Unable to allocate 152. GiB for an array with shape (152040,134370) and data type float64
。
有什么想法吗?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。