微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

欠采样无法提高二进制分类的精度

如何解决欠采样无法提高二进制分类的精度

我的代码段-

X,y = make_classification(n_samples=20000,n_features=8,n_informative=6,n_classes=2,weights=[150/151,1/151],n_redundant=2,n_clusters_per_class=3,class_sep=1.5,random_state=1729)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
LogisticRegression().fit(X_train,y_train)
training_score = cross_val_score(LogisticRegression(),X_train,cv=5)

log_reg_params = {"penalty": ['l1','l2'],'C': [0.001,0.01,0.1,1,10,100,1000]}

grid_log_reg = gridsearchcv(LogisticRegression(),log_reg_params)
grid_log_reg.fit(X_train,y_train)

log_reg = grid_log_reg.best_estimator_

log_reg_score = cross_val_score(log_reg,cv=5)

sss = StratifiedKFold(n_splits=5,random_state=None,shuffle=False)

for train_index,test_index in sss.split(X,y):
    print("Train:",train_index,"Test:",test_index)
    Xtrain,Xtest = X[train_index],X[test_index]
    ytrain,ytest = y[train_index],y[test_index]

for train,test in sss.split(Xtrain,ytrain):
    pipeline = imbalanced_make_pipeline(RandomUnderSampler(sampling_strategy='majority',random_state=42),log_reg) # SMOTE happens during Cross Validation not before..
    model = pipeline.fit(Xtrain[train],ytrain[train])
    prediction = model.predict(Xtrain[test])

我已经使用sci-kit make_classification方法来创建各种级别的不平衡数据集。然后,我将应用重采样技术以查看其有效性。根据我所做的研究,应用欠采样时,精度总是会以召回为代价而提高,但在我的情况下却没有发生。欠采样的性能与不重新采样非常相似。我想知道我是否在代码中犯了一些错误或执行欠采样的原因。

感谢您的帮助!

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。