微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

imblearn管道会关闭采样进行测试吗?

如何解决imblearn管道会关闭采样进行测试吗?

让我们假设以下代码(来自imblearn example on pipelines

...    
# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)

# Create the samplers
enn = EditednearestNeighbours()
renn = RepeatedEditednearestNeighbours()

# Create the classifier
knn = KNN(1)

# Make the splits
X_train,X_test,y_train,y_test = tts(X,y,random_state=42)

# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca,enn,renn,knn)

pipeline.fit(X_train,y_train)
y_hat = pipeline.predict(X_test)

我要确保在执行pipeline.predict(X_test)时不会执行采样程序ennrenn(但是当然必须执行pca

  1. 首先,我很清楚over-,under-,and mixed-sampling是 适用于training set的程序,而不是适用于 test/validation set。如果我错了,请在这里纠正我。

  2. 我浏览了imblearn Pipeline代码,但找不到 那里的predict方法

  3. 我还想确保这种正确的行为在 管道位于gridsearchcv

    内部

我只需要保证imblearn.Pipeline会发生这种情况。

编辑:2020-08-28

@wundermahn的答案就是我所需要的。

此编辑只是为了补充,这是一个原因,应该使用imblearn.Pipeline进行不平衡的预处理,而不是sklearn.Pipeline,在imblearn文档中没有地方我找到解释了为什么imblearn.Pipeline时需要sklearn.Pipeline

解决方法

好问题。按照您发布的顺序浏览它们:

  1. 首先,我很清楚,过采样,欠采样和混合采样是要应用于训练集的过程,而不是应用于 测试/验证集。如果我错了,请在这里纠正我。

是正确的。您当然不希望对代表实际的,实时的“生产”数据集的数据进行测试(无论是在test还是validation数据上)。您实际上应该只将此应用于培训。请注意,如果您使用的是交叉折叠验证之类的技术,则应按照this IEEE paper的指示将采样分别应用于每个折叠。

  1. 我浏览了imblearn的Pipeline代码,但在那儿找不到预测方法。

我假设您找到了imblearn.pipeline source code,因此,如果您找到了,想做的就是看看fit_predict方法:

 @if_delegate_has_method(delegate="_final_estimator")
    def fit_predict(self,X,y=None,**fit_params):
        """Apply `fit_predict` of last step in pipeline after transforms.
        Applies fit_transforms of a pipeline to the data,followed by the
        fit_predict method of the final estimator in the pipeline. Valid
        only if the final estimator implements fit_predict.
        Parameters
        ----------
        X : iterable
            Training data. Must fulfill input requirements of first step of
            the pipeline.
        y : iterable,default=None
            Training targets. Must fulfill label requirements for all steps
            of the pipeline.
        **fit_params : dict of string -> object
            Parameters passed to the ``fit`` method of each step,where
            each parameter name is prefixed such that parameter ``p`` for step
            ``s`` has key ``s__p``.
        Returns
        -------
        y_pred : ndarray of shape (n_samples,)
            The predicted target.
        """
        Xt,yt,fit_params = self._fit(X,y,**fit_params)
        with _print_elapsed_time('Pipeline',self._log_message(len(self.steps) - 1)):
            y_pred = self.steps[-1][-1].fit_predict(Xt,**fit_params)
        return y_pred

在这里,我们可以看到pipeline在管道中使用了最终估算器的.predict方法,在您发布的示例scikit-learn's knn中:

 def predict(self,X):
        """Predict the class labels for the provided data.
        Parameters
        ----------
        X : array-like of shape (n_queries,n_features),\
                or (n_queries,n_indexed) if metric == 'precomputed'
            Test samples.
        Returns
        -------
        y : ndarray of shape (n_queries,) or (n_queries,n_outputs)
            Class labels for each data sample.
        """
        X = check_array(X,accept_sparse='csr')

        neigh_dist,neigh_ind = self.kneighbors(X)
        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1,1))
            classes_ = [self.classes_]

        n_outputs = len(classes_)
        n_queries = _num_samples(X)
        weights = _get_weights(neigh_dist,self.weights)

        y_pred = np.empty((n_queries,n_outputs),dtype=classes_[0].dtype)
        for k,classes_k in enumerate(classes_):
            if weights is None:
                mode,_ = stats.mode(_y[neigh_ind,k],axis=1)
            else:
                mode,_ = weighted_mode(_y[neigh_ind,weights,axis=1)

            mode = np.asarray(mode.ravel(),dtype=np.intp)
            y_pred[:,k] = classes_k.take(mode)

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred
  1. 我还想确保当管道位于GridSearchCV内时,此正确的行为有效

这种假设上面的两个假设都是正确的,我认为这意味着您希望complete,minimal,reproducible example可以在GridSearchCV中工作。 scikit-learn on this有大量文档,但下面是我使用knn创建的示例:

import pandas as pd,numpy as np

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV,train_test_split

param_grid = [
    {
        'classification__n_neighbors': [1,3,5,7,10],}
]

X,y = load_digits(return_X_y=True)
X_train,X_test,y_train,y_test = train_test_split(X,stratify=y,test_size=0.20)

pipe = Pipeline([
    ('sampling',SMOTE()),('classification',KNeighborsClassifier())
])

grid = GridSearchCV(pipe,param_grid=param_grid)
grid.fit(X_train,y_train)
mean_scores = np.array(grid.cv_results_['mean_test_score'])
print(mean_scores)

# [0.98051926 0.98121129 0.97981998 0.98050474 0.97494193]

您的直觉很明显,干得不错:)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。