使用 LeavePGroupsOut 运行嵌套交叉验证后，分别获取每个组的测试分数

如何解决使用 LeavePGroupsOut 运行嵌套交叉验证后，分别获取每个组的测试分数

我正在使用 sklearn.model_selection.LeavePGroupsOut 在我的数据集中的每个站点上训练分类器，并在所有其他站点上进行测试。现在我遇到了这个问题：运行分析后，我只获得了用于测试的所有 p 站点的 'global' 测试分数。相反，我正在寻找一种为每个站点分别获得测试分数的方法。

这是一个示例，我使用 breast_cancer 数据集并创建了三个虚拟站点，这些站点将分配给受试者（请注意，我为每个组创建了不同的样本大小，请参阅下面的部分我为什么这样做):

import numpy as np
from sklearn.model_selection import gridsearchcv
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import LeavePGroupsOut
from sklearn.model_selection import cross_validate
from sklearn.datasets import load_breast_cancer

# create a random number generator
rng = np.random.RandomState(42)

# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)

# for this example,only take the first 300 subjects
X = X[0:300,:]
y = y[0:300]

# define dummy sites,let's assume all subjects came from three different sites
# Let's also assume the three sites have different numbers of subjects
groups = np.concatenate((np.repeat('site_1',150),np.repeat('site_2',100),np.repeat('site_3',50)))

# optimize classifier on one site and leave two sites out for testing
n_groups = 2

# z-standardize features
scaler = StandardScaler()

# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng)

# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=0.015625,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}

# create pipeline
lr_pipe = Pipeline([
    ('scaler',scaler),('lr',lr)
    ])

# define inner and outer folds (use LeavePGroupsOut)
skf_inner = StratifiedKFold(shuffle=True,random_state=rng)
lpgo_outer = LeavePGroupsOut(n_groups=n_groups)

# implement GridSearch (inner cross validation)
grid = gridsearchcv(lr_pipe,param_grid=p_grid,cv=skf_inner,verbose=1,)

# implement cross_validate (outer cross validation)
nested_cv_scores = cross_validate(grid,X,y,groups=groups,cv=lpgo_outer,return_train_score=True,return_estimator=True,verbose=1
                                  )

现在，当人们查看 nested_cv_scores['test_score'] 时，会得到以下三个测试分数：0.915,0.945,0.96。相反，我想获得 6 个分数（三个站点中的每一个都用于训练，另外两个用于测试）。

我已经想到的：

我已经想出了从三个最终估计器 (nested_cv_scores['estimator'][idx].best_estimator_) 中的每一个获取管道对象并使用

再次运行 LeavePGroupsOut 的想法

 train_index,test_index in lpgo_outer.split(X,groups):
    ...

有了这个，我想可以单独重新计算每个站点的测试分数（通过调用 predict 方法，然后使用 y_pred 和 y_true 计算测试分数。

虽然我想知道是否有更优雅的方法来解决这个问题？也许我已经监督了 LeavePGroupsOut 的替代方案？另请注意，我不能在这里使用 sklearn.model_selection.cross_val_predict，因为这三个站点具有不同的样本大小（当使用 cross_val_predict 而不是 cross_validate 时，一个会得到 ValueError: cross_val_predict only works for partitions）

解决方法

现在应该可以解决问题：

site_scores = []

for idx,(train_index,test_index) in enumerate(lpgo_outer.split(X,y,groups)):
    
    # obtain name of the site that was used for training the classifier
    train_site_name = str(np.unique(groups[train_index])[0])
    
    # obtain the final estimator object for this training site
    train_site_estimator = nested_cv_scores['estimator'][idx].best_estimator_
    
    # obtain the train score for this estimator
    train_site_train_score = nested_cv_scores['train_score'][idx]
    
    # get the features and labels for all the other sites
    X_test,y_test = X[test_index],y[test_index]
    
    # obtain predictions
    y_pred = train_site_estimator.predict(X_test)
    
    # sanity check: make sure that the following score matches 'test_score'
    # in nested_cv_scores['test_score']
    sanity_check_bac = balanced_accuracy_score(y_true=y_test,y_pred=y_pred)
    
    if sanity_check_bac != nested_cv_scores['test_score'][idx]:
        raise ValueError('Manually calculcated test score does not match test score in nested_cv_scores')
    
    # get an array for the test sites
    test_sites = groups[test_index]
    
    # create a dataframe from y_true,y_pred and names of test sites
    test_sites_df = pd.DataFrame({'y_true':y_test,'y_pred':y_pred,'site':test_sites})
    
    # calculate BAC seperately for each site
    for name,group in test_sites_df.groupby('site'):
        
        bac = balanced_accuracy_score(group['y_true'],group['y_pred'])
        site_scores.append((train_site_name,train_site_train_score,name,bac))

df = pd.DataFrame(site_scores,columns=['train_site','train_site_score','test_site','test_site_score'])

使用 LeavePGroupsOut 运行嵌套交叉验证后，分别获取每个组的测试分数

如何解决使用 LeavePGroupsOut 运行嵌套交叉验证后，分别获取每个组的测试分数

解决方法

相关推荐