获取RandomClassifier的特征名称单词-使用spaCy进行文本分类

如何解决获取RandomClassifier的特征名称单词-使用spaCy进行文本分类

在基于医疗技术和非医疗技术相关内容的结合文本对某些专利进行分类的尝试中，我的准确性非常高。因此，我想看看用于进行分类的最重要的单词。

我使用了this教程来处理spaCy模型，但使用了RandomClassifier而不是LinearSVC，因为LinearSVC不支持predict_proba，这与我的问题更相关。这是我的代码：

def printNMostinformative(vectorizer,clf,N):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0],feature_names))
    topClass1 = coefs_with_fns[:N]
    topClass2 = coefs_with_fns[:-(N + 1):-1]
    print("Class 1 best: ")
    for feat in topClass1:
        print(feat)
    print("Class 2 best: ")
    for feat in topClass2:
        print(feat)

class RandomForestClassifierWithCoef(RandomForestClassifier):
    def fit(self,*args,**kwargs):
        super(RandomForestClassifierWithCoef,self).fit(*args,**kwargs)
        self.coef_ = self.feature_importances_

vectorizer = CountVectorizer(tokenizer=tokenizeText,ngram_range=(1,1))
clf = RandomForestClassifierWithCoef(n_estimators=1000,random_state=0)
pipe = Pipeline([('cleanText',CleanTextTransformer()),('vectorizer',vectorizer),('clf',clf)])

# data
train1 = train['Whole_text'].tolist()
labelsTrain1 = train['Med_area'].tolist()

test1 = test['Whole_text'].tolist()
labelsTest1 = test['Med_area'].tolist()
# train
pipe.fit(train1,labelsTrain1)

# test
preds = pipe.predict(test1)
print("accuracy:",accuracy_score(labelsTest1,preds))
print("Top 10 features used to predict: ")
printNMostinformative(vectorizer,10)

pipe = Pipeline([('cleanText',vectorizer)])
transform = pipe.fit_transform(train1,labelsTrain1)
vocab = vectorizer.get_feature_names()

for i in range(len(train1)):
    s = ""
    indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]]
    numOccurences = transform.data[transform.indptr[i]:transform.indptr[i+1]]
    for idx,num in zip(indexIntoVocab,numOccurences):
        s += str((vocab[idx],num))

我不断收到此错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-23-4e74698a75fc> in <module>
     33 print("accuracy:",preds))
     34 print("Top 10 features used to predict: ")
---> 35 printNMostinformative(vectorizer,10)
     36 
     37 pipe = Pipeline([('cleanText',vectorizer)])

<ipython-input-23-4e74698a75fc> in printNMostinformative(vectorizer,N)
      1 def printNMostinformative(vectorizer,N):
      2     feature_names = vectorizer.get_feature_names()
----> 3     coefs_with_fns = sorted(zip(clf.coef_[0],feature_names))
      4     topClass1 = coefs_with_fns[:N]
      5     topClass2 = coefs_with_fns[:-(N + 1):-1]

**TypeError: zip argument #1 must support iteration**

我有2个问题：

我该如何解决此问题并查看每个班级中最重要的单词（功能）？
如果我使用dict_proba并为roc_auc_score使用，有什么方法可以看到这一点吗？