如何解决欠采样无法提高二进制分类的精度
我的代码段-
X,y = make_classification(n_samples=20000,n_features=8,n_informative=6,n_classes=2,weights=[150/151,1/151],n_redundant=2,n_clusters_per_class=3,class_sep=1.5,random_state=1729)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
LogisticRegression().fit(X_train,y_train)
training_score = cross_val_score(LogisticRegression(),X_train,cv=5)
log_reg_params = {"penalty": ['l1','l2'],'C': [0.001,0.01,0.1,1,10,100,1000]}
grid_log_reg = gridsearchcv(LogisticRegression(),log_reg_params)
grid_log_reg.fit(X_train,y_train)
log_reg = grid_log_reg.best_estimator_
log_reg_score = cross_val_score(log_reg,cv=5)
sss = StratifiedKFold(n_splits=5,random_state=None,shuffle=False)
for train_index,test_index in sss.split(X,y):
print("Train:",train_index,"Test:",test_index)
Xtrain,Xtest = X[train_index],X[test_index]
ytrain,ytest = y[train_index],y[test_index]
for train,test in sss.split(Xtrain,ytrain):
pipeline = imbalanced_make_pipeline(RandomUnderSampler(sampling_strategy='majority',random_state=42),log_reg) # SMOTE happens during Cross Validation not before..
model = pipeline.fit(Xtrain[train],ytrain[train])
prediction = model.predict(Xtrain[test])
我已经使用sci-kit make_classification方法来创建各种级别的不平衡数据集。然后,我将应用重采样技术以查看其有效性。根据我所做的研究,应用欠采样时,精度总是会以召回为代价而提高,但在我的情况下却没有发生。欠采样的性能与不重新采样非常相似。我想知道我是否在代码中犯了一些错误或执行欠采样的原因。
感谢您的帮助!
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。