ValueError：系数数量与特征数量不匹配Mglearn 可视化

如何解决ValueError：系数数量与特征数量不匹配Mglearn 可视化

我正在尝试根据从各个网站收集的产品评论进行情绪分析。我已经能够跟随下面的文章，直到它进入模型系数可视化步骤。

https://towardsdatascience.com/how-a-simple-algorithm-classifies-texts-with-moderate-accuracy-79f0cd9eb47

当我运行我的程序时，出现以下错误：

ValueError: Number of coefficients 6021 doesn't match number offeature names 6290.

关于如何确保系数的数量与我的数据集中的特征数量相匹配有什么建议吗？

下面是我的代码：

y = reviews['Review Type']
X = reviews['Review Comment']

#Split the data into training and test sets
from sklearn.model_selection import train_test_split
text_train,text_test,y_train,y_test = train_test_split(X,y,random_state=0)

#run the feature extraction on training & test independent variables with bag of words
#changing the variable back to X_train after transforming it.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print(repr(X_train))

X_test = vect.transform(text_test)
print(repr(X_test))

feature_names = vect.get_feature_names()
print(len(feature_names))

#running a logistic regression model to predict whether a review is positive 
#or negative

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import gridsearchcv
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=10000,class_weight='balanced',random_state=0)
param_grid = {'C': [0.01,0.1,1,10,100]}


grid = gridsearchcv(logreg,param_grid,scoring= 'roc_auc',cv=5)
logreg_train = grid.fit(X_train,y_train)

pred_logreg = logreg_train.predict(X_test)
confusion = confusion_matrix(y_test,pred_logreg)
print(confusion)
print("Classification accuracy is: ",(confusion[0][0] + confusion[1][1]) / np.sum(confusion))

from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import seaborn as sns; sns.set();

fpr,tpr,thresholds = roc_curve(y_test,grid.decision_function(X_test))
# find threshold closest to zero:
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero],tpr[close_zero],'o',markersize=10,label= 'threshold zero(default)',fillstyle= 'none',c='k',mew=2)
plt.plot([0,1],linestyle='-',lw=2,color='r',label='random',alpha=0.8)
plt.legend(loc=4)
plt.plot(fpr,label='ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (recall)')
plt.title('roc_curve');
from sklearn.metrics import auc
print('AUC score is: ',auc(fpr,tpr));


from sklearn.metrics import precision_recall_curve
precision,recall,thresholds = precision_recall_curve(\
                                                      y_test,logreg_train.decision_function(X_test))
close_zero = np.argmin(np.abs(thresholds))
plt.plot(precision[close_zero],recall[close_zero],label="threhold zero",fillstyle="none",c="k",mew=2)
plt.plot(precision,label="precision recall curve")
plt.xlabel("precision")
plt.ylabel("recall")
plt.title("Precision Recall Curve")
plt.legend(loc="best");

from sklearn.feature_extraction.text import TfidfVectorizer
logreg = LogisticRegression(max_iter=10000,class_weight="balanced",random_state=0)
pipe = make_pipeline(TfidfVectorizer(norm=None,stop_words='english'),logreg)
param_grid = {'logisticregression__C': [0.001,0.01,10]}
grid = gridsearchcv(pipe,scoring="roc_auc",cv=5)
logreg_train = grid.fit(text_train,y_train)

fpr,grid.decision_function(text_test))
pred_logreg = logreg_train.predict(text_test)
confusion = confusion_matrix(y_test,(confusion[0][0] + confusion[1][1]) / np.sum(confusion)) 
print("Test AUC score is: ",tpr));

mglearn.tools.visualize_coefficients(grid.best_estimator_.named_steps["logisticregression"].coef_,feature_names,n_top_features=25)

解决方法

您已经根据 feature_names 的特征定义了 CountVectorizer，默认为 stop_words=None，但您的模型在最后一段代码中使用的是 TfidfVectorizer与stop_words='english'。改用

feature_names = grid.best_estimator_.named_steps["tfidfvectorizer"].get_feature_names()