微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

获取RandomClassifier的特征名称单词-使用spaCy进行文本分类

如何解决获取RandomClassifier的特征名称单词-使用spaCy进行文本分类

在基于医疗技术和非医疗技术相关内容的结合文本对某些专利进行分类的尝试中,我的准确性非常高。因此,我想看看用于进行分类的最重要的单词。

我使用了this教程来处理spaCy模型,但使用了RandomClassifier而不是LinearSVC,因为LinearSVC不支持predict_proba,这与我的问题更相关。这是我的代码

def printNMostinformative(vectorizer,clf,N):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0],feature_names))
    topClass1 = coefs_with_fns[:N]
    topClass2 = coefs_with_fns[:-(N + 1):-1]
    print("Class 1 best: ")
    for feat in topClass1:
        print(feat)
    print("Class 2 best: ")
    for feat in topClass2:
        print(feat)

class RandomForestClassifierWithCoef(RandomForestClassifier):
    def fit(self,*args,**kwargs):
        super(RandomForestClassifierWithCoef,self).fit(*args,**kwargs)
        self.coef_ = self.feature_importances_

vectorizer = CountVectorizer(tokenizer=tokenizeText,ngram_range=(1,1))
clf = RandomForestClassifierWithCoef(n_estimators=1000,random_state=0)
pipe = Pipeline([('cleanText',CleanTextTransformer()),('vectorizer',vectorizer),('clf',clf)])

# data
train1 = train['Whole_text'].tolist()
labelsTrain1 = train['Med_area'].tolist()

test1 = test['Whole_text'].tolist()
labelsTest1 = test['Med_area'].tolist()
# train
pipe.fit(train1,labelsTrain1)

# test
preds = pipe.predict(test1)
print("accuracy:",accuracy_score(labelsTest1,preds))
print("Top 10 features used to predict: ")
printNMostinformative(vectorizer,10)

pipe = Pipeline([('cleanText',vectorizer)])
transform = pipe.fit_transform(train1,labelsTrain1)
vocab = vectorizer.get_feature_names()

for i in range(len(train1)):
    s = ""
    indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]]
    numOccurences = transform.data[transform.indptr[i]:transform.indptr[i+1]]
    for idx,num in zip(indexIntoVocab,numOccurences):
        s += str((vocab[idx],num))

我不断收到此错误

TypeError                                 Traceback (most recent call last)
<ipython-input-23-4e74698a75fc> in <module>
     33 print("accuracy:",preds))
     34 print("Top 10 features used to predict: ")
---> 35 printNMostinformative(vectorizer,10)
     36 
     37 pipe = Pipeline([('cleanText',vectorizer)])

<ipython-input-23-4e74698a75fc> in printNMostinformative(vectorizer,N)
      1 def printNMostinformative(vectorizer,N):
      2     feature_names = vectorizer.get_feature_names()
----> 3     coefs_with_fns = sorted(zip(clf.coef_[0],feature_names))
      4     topClass1 = coefs_with_fns[:N]
      5     topClass2 = coefs_with_fns[:-(N + 1):-1]

**TypeError: zip argument #1 must support iteration**

我有2个问题:

  1. 我该如何解决此问题并查看每个班级中最重要的单词(功能)?
  2. 如果我使用dict_proba并为roc_auc_score使用,有什么方法可以看到这一点吗?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。