微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何使用 WIDF 算法处理数据集文档 (CSV)

如何解决如何使用 WIDF 算法处理数据集文档 (CSV)

我的程序有问题,我创建了一个系统来使用 WIDF 算法使用 python 代码对文档(csv)进行分类

这是 WIDF 算法:

import pprint
     
    class WIdf():
        
        def __init__(self):
            self.total_tf = 0
            self.total_weight = 0
            self.document = []
            self.query = ''
            self.corpus = {}
    
        def transform(self,q,document):
            self.query = q
            self.document = document
            for index,item in enumerate(self.document):
                words = item.split(' ')
                tf = 0
                for word in words:
                    if(self.query.lower() == word.lower()):
                        tf += 1
                self.total_tf += tf
                self.corpus[index] = {"tf" : tf}
            return self
        
        def weight(self):
            for key,value in self.corpus.items():
                weight = value['tf'] / self.total_tf
                self.corpus[key]['weight'] = weight
                self.total_weight += weight
    
        def get_weight(self):
            self.total_weight = 0
            self.weight()
            return self.corpus
    
        def weight_average(self): #bikinan sendiri
            self.total_weight = 0
            self.weight()
            return self.total_weight / len(self.document)

这是如何处理文本数据集的程序:

import pprint
from widf import WIdf

print("1")
texts = ['hatiNN buahNN anugerahNN cintaNN buahNN deritaVB pendamNN hasratNN cobaVB kenalVB bedaJJ takNEG kanMD mungkinMD satuCD jauhJJ dasarNN hatiNN semuaCD sulitJJ akhirNN cintaNN takNEG mampuJJ rubahNN sifatNN bosanNN sikapNN slaluNN abaiNN semuaCD buatIN diriNN cintaNN takNEG kanMD akhirNN hubungNN cintaNN sangatRB untungNN hidupNN','akuVB takNEG mampuJJ sakitNN akuVB takNEG sanggupNN akuVB takNEG mampuJJ sakitNN akuVB takNEG sanggupNN takNEG mungkinMD cintaNN hatiNN tlahNN milikNN takNEG mungkinMD milikNN sepenuhJJ hatiNN akuVB setiaJJ akuVB hargaNN tulusJJ cintaNN milikNN takNEG mungkinMD cintaNN hatiNN tlahNN milikNN takNEG mungkinMD milikNN sepenuhJJ hatiNN akuVB setiaJJ takNEG mungkinMD cintaNN hatiNN tlahNN milikNN takNEG mungkinMD milikNN sepenuhJJ hatiNN akuVB setiaJJ akuVB setiaJJ',] /this is dataset and i will convert to document

q='cintaNN' /this is a word to be searched for weighting value

print('')
print('Pembobotan W-IDF')
widf = WIdf().transform(q=q,document=texts)
print("Bobot rata-rata: " + str(widf.weight_average()))
pprint.pprint(widf.get_weight())
print("+---------------------------------+")
text_features = tfidf.transform(texts)
predictions = model.predict(text_features)
for text,predicted in zip(texts,predictions):
  #print('"{}"'.format(text))

这个程序是以数据集的形式搜索一个句子中的词权重。所以这里我将一个原本是文本形式的数据集转换处理成文档(CSV)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。