如何解决TFIDF 向量化器实现与 sklearn 不匹配
我正在使用拟合和变换函数实现 TFIDF 向量化器。矢量化器的值应该与 sklearn 的值匹配,但它们不匹配。我尽了最大的努力,但无法弄清楚。
我的代码:
# Function to calculate the no. of times a word appears in a whole dataset
def fit(corpus):
unique_words = []
for row in corpus:
for word in row.split(" "):
if len(word)>=2 and word not in unique_words:
unique_words.append(word) # Add each unique word of length>2 to the list
unique_words.sort()
vocab = {j:i for i,j in enumerate(unique_words)}
return vocab
def IDF(corpus,word):
count = 0
for row in corpus:
if word in row.split():
count = count+1
return count
# TRANSFORM METHOD
# Input : set of documents,vocab from fit() ; Output : TF-IDF Matrix
def transform(corpus,vocab):
rows = []
columns = []
values = []
tf_val = []
idf_val = []
for idx,row in enumerate(corpus):
word_freq = dict(Counter(row.split()))
for word,freq in word_freq.items():
if len(word) < 2:
continue
# we will check if its there in the vocabulary that we build in fit() function
# dict.get() function will return the values,if the key doesn't exits it will return -1
col_index = vocab.get(word,-1) # retrieving the dimension number of a word
if col_index!=-1:
# we are storing the index of the document
rows.append(idx)
# we are storing the dimensions of the word
columns.append(col_index)
# computes TF value for each word,freq of each word / total words in a document
# computes IDF value for each word=log(total no. of docus / no. of times a word is present in a doc via IDF()
# then compute TF * IDF
idf_value = 1 + (math.log(1 + len(corpus)/1 + IDF(corpus,word)))
tf_value = (freq/len(row.split()))
val = (tf_value)*(idf_value)
values.append(val)
print(idf_value)
return normalize(csr_matrix((values,(rows,columns)),shape=(len(corpus),len(vocab))),norm='l2')
(0,1) 0.440693994585412
(0,2) 0.42158452878722275
(0,3) 0.4575497379970868
(0,6) 0.4575497379970868
(0,8) 0.4575497379970868
预期输出:
(0,8) 0.38408524091481483
(0,6) 0.38408524091481483
(0,3) 0.38408524091481483
(0,2) 0.5802858236844359
(0,1) 0.46979138557992045
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。