如何解决ValueError:系数数量与特征数量不匹配Mglearn 可视化
我正在尝试根据从各个网站收集的产品评论进行情绪分析。我已经能够跟随下面的文章,直到它进入模型系数可视化步骤。
当我运行我的程序时,出现以下错误:
ValueError: Number of coefficients 6021 doesn't match number offeature names 6290.
关于如何确保系数的数量与我的数据集中的特征数量相匹配有什么建议吗?
下面是我的代码:
y = reviews['Review Type']
X = reviews['Review Comment']
#Split the data into training and test sets
from sklearn.model_selection import train_test_split
text_train,text_test,y_train,y_test = train_test_split(X,y,random_state=0)
#run the feature extraction on training & test independent variables with bag of words
#changing the variable back to X_train after transforming it.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print(repr(X_train))
X_test = vect.transform(text_test)
print(repr(X_test))
feature_names = vect.get_feature_names()
print(len(feature_names))
#running a logistic regression model to predict whether a review is positive
#or negative
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import gridsearchcv
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=10000,class_weight='balanced',random_state=0)
param_grid = {'C': [0.01,0.1,1,10,100]}
grid = gridsearchcv(logreg,param_grid,scoring= 'roc_auc',cv=5)
logreg_train = grid.fit(X_train,y_train)
pred_logreg = logreg_train.predict(X_test)
confusion = confusion_matrix(y_test,pred_logreg)
print(confusion)
print("Classification accuracy is: ",(confusion[0][0] + confusion[1][1]) / np.sum(confusion))
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import seaborn as sns; sns.set();
fpr,tpr,thresholds = roc_curve(y_test,grid.decision_function(X_test))
# find threshold closest to zero:
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero],tpr[close_zero],'o',markersize=10,label= 'threshold zero(default)',fillstyle= 'none',c='k',mew=2)
plt.plot([0,1],linestyle='-',lw=2,color='r',label='random',alpha=0.8)
plt.legend(loc=4)
plt.plot(fpr,label='ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (recall)')
plt.title('roc_curve');
from sklearn.metrics import auc
print('AUC score is: ',auc(fpr,tpr));
from sklearn.metrics import precision_recall_curve
precision,recall,thresholds = precision_recall_curve(\
y_test,logreg_train.decision_function(X_test))
close_zero = np.argmin(np.abs(thresholds))
plt.plot(precision[close_zero],recall[close_zero],label="threhold zero",fillstyle="none",c="k",mew=2)
plt.plot(precision,label="precision recall curve")
plt.xlabel("precision")
plt.ylabel("recall")
plt.title("Precision Recall Curve")
plt.legend(loc="best");
from sklearn.feature_extraction.text import TfidfVectorizer
logreg = LogisticRegression(max_iter=10000,class_weight="balanced",random_state=0)
pipe = make_pipeline(TfidfVectorizer(norm=None,stop_words='english'),logreg)
param_grid = {'logisticregression__C': [0.001,0.01,10]}
grid = gridsearchcv(pipe,scoring="roc_auc",cv=5)
logreg_train = grid.fit(text_train,y_train)
fpr,grid.decision_function(text_test))
pred_logreg = logreg_train.predict(text_test)
confusion = confusion_matrix(y_test,(confusion[0][0] + confusion[1][1]) / np.sum(confusion))
print("Test AUC score is: ",tpr));
mglearn.tools.visualize_coefficients(grid.best_estimator_.named_steps["logisticregression"].coef_,feature_names,n_top_features=25)
解决方法
您已经根据 feature_names
的特征定义了 CountVectorizer
,默认为 stop_words=None
,但您的模型在最后一段代码中使用的是 TfidfVectorizer
与stop_words='english'
。改用
feature_names = grid.best_estimator_.named_steps["tfidfvectorizer"].get_feature_names()
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。