微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

TFIDF 向量化器实现与 sklearn 不匹配

如何解决TFIDF 向量化器实现与 sklearn 不匹配

我正在使用拟合和变换函数实现 TFIDF 向量化器。矢量化器的值应该与 sklearn 的值匹配,但它们不匹配。我尽了最大的努力,但无法弄清楚。

我的代码

 # Function to calculate the no. of times a word appears in a whole dataset
 def fit(corpus):
  unique_words = [] 
 for row in corpus:
 for word in row.split(" "):
   if len(word)>=2 and word not in unique_words:
    unique_words.append(word) # Add each unique word of length>2 to the list
    unique_words.sort()
    vocab = {j:i for i,j in enumerate(unique_words)} 
    return vocab
   
def IDF(corpus,word):
   count = 0
   for row in corpus:
     if word in row.split():
       count = count+1
     return count

# TRANSFORM METHOD
# Input : set of documents,vocab from fit() ; Output : TF-IDF Matrix   
def transform(corpus,vocab):
  rows = []
  columns = []
  values = []
  tf_val = []
  idf_val = []
  for idx,row in enumerate(corpus): 
  word_freq = dict(Counter(row.split())) 

  for word,freq in word_freq.items():
   if len(word) < 2:
    continue 
  # we will check if its there in the vocabulary that we build in fit() function
  # dict.get() function will return the values,if the key doesn't exits it will return -1
  col_index = vocab.get(word,-1) # retrieving the dimension number of a word
  if col_index!=-1:
    # we are storing the index of the document
    rows.append(idx)
    # we are storing the dimensions of the word
    columns.append(col_index)
    # computes TF value for each word,freq of each word / total words in a document
    # computes IDF value for each word=log(total no. of docus / no. of times a word is present in a doc via IDF()
    # then compute TF * IDF
    idf_value = 1 + (math.log(1 + len(corpus)/1 + IDF(corpus,word)))
    tf_value = (freq/len(row.split())) 
    val = (tf_value)*(idf_value)
    values.append(val)
    print(idf_value)

    return normalize(csr_matrix((values,(rows,columns)),shape=(len(corpus),len(vocab))),norm='l2')

我的代码输出

 (0,1) 0.440693994585412
 (0,2) 0.42158452878722275
 (0,3) 0.4575497379970868
 (0,6) 0.4575497379970868
 (0,8) 0.4575497379970868

预期输出

 (0,8) 0.38408524091481483
 (0,6) 0.38408524091481483
 (0,3) 0.38408524091481483
 (0,2) 0.5802858236844359
 (0,1) 0.46979138557992045

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。