如何解决使用 LeavePGroupsOut 运行嵌套交叉验证后,分别获取每个组的测试分数
我正在使用 sklearn.model_selection.LeavePGroupsOut
在我的数据集中的每个站点上训练分类器,并在所有其他站点上进行测试。现在我遇到了这个问题:运行分析后,我只获得了用于测试的所有 p 站点的 'global' 测试分数。相反,我正在寻找一种为每个站点分别获得测试分数的方法。
这是一个示例,我使用 breast_cancer
数据集并创建了三个虚拟站点,这些站点将分配给受试者(请注意,我为每个组创建了不同的样本大小,请参阅下面的部分我为什么这样做):
import numpy as np
from sklearn.model_selection import gridsearchcv
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import LeavePGroupsOut
from sklearn.model_selection import cross_validate
from sklearn.datasets import load_breast_cancer
# create a random number generator
rng = np.random.RandomState(42)
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# for this example,only take the first 300 subjects
X = X[0:300,:]
y = y[0:300]
# define dummy sites,let's assume all subjects came from three different sites
# Let's also assume the three sites have different numbers of subjects
groups = np.concatenate((np.repeat('site_1',150),np.repeat('site_2',100),np.repeat('site_3',50)))
# optimize classifier on one site and leave two sites out for testing
n_groups = 2
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=0.015625,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
# create pipeline
lr_pipe = Pipeline([
('scaler',scaler),('lr',lr)
])
# define inner and outer folds (use LeavePGroupsOut)
skf_inner = StratifiedKFold(shuffle=True,random_state=rng)
lpgo_outer = LeavePGroupsOut(n_groups=n_groups)
# implement GridSearch (inner cross validation)
grid = gridsearchcv(lr_pipe,param_grid=p_grid,cv=skf_inner,verbose=1,)
# implement cross_validate (outer cross validation)
nested_cv_scores = cross_validate(grid,X,y,groups=groups,cv=lpgo_outer,return_train_score=True,return_estimator=True,verbose=1
)
现在,当人们查看 nested_cv_scores['test_score']
时,会得到以下三个测试分数:0.915,0.945,0.96
。相反,我想获得 6 个分数(三个站点中的每一个都用于训练,另外两个用于测试)。
我已经想到的:
我已经想出了从三个最终估计器 (nested_cv_scores['estimator'][idx].best_estimator_
) 中的每一个获取管道对象并使用
LeavePGroupsOut
的想法
train_index,test_index in lpgo_outer.split(X,groups):
...
有了这个,我想可以单独重新计算每个站点的测试分数(通过调用 predict
方法,然后使用 y_pred
和 y_true
计算测试分数。
虽然我想知道是否有更优雅的方法来解决这个问题?也许我已经监督了 LeavePGroupsOut
的替代方案?另请注意,我不能在这里使用 sklearn.model_selection.cross_val_predict
,因为这三个站点具有不同的样本大小(当使用 cross_val_predict
而不是 cross_validate
时,一个会得到 ValueError: cross_val_predict only works for partitions
)
解决方法
现在应该可以解决问题:
site_scores = []
for idx,(train_index,test_index) in enumerate(lpgo_outer.split(X,y,groups)):
# obtain name of the site that was used for training the classifier
train_site_name = str(np.unique(groups[train_index])[0])
# obtain the final estimator object for this training site
train_site_estimator = nested_cv_scores['estimator'][idx].best_estimator_
# obtain the train score for this estimator
train_site_train_score = nested_cv_scores['train_score'][idx]
# get the features and labels for all the other sites
X_test,y_test = X[test_index],y[test_index]
# obtain predictions
y_pred = train_site_estimator.predict(X_test)
# sanity check: make sure that the following score matches 'test_score'
# in nested_cv_scores['test_score']
sanity_check_bac = balanced_accuracy_score(y_true=y_test,y_pred=y_pred)
if sanity_check_bac != nested_cv_scores['test_score'][idx]:
raise ValueError('Manually calculcated test score does not match test score in nested_cv_scores')
# get an array for the test sites
test_sites = groups[test_index]
# create a dataframe from y_true,y_pred and names of test sites
test_sites_df = pd.DataFrame({'y_true':y_test,'y_pred':y_pred,'site':test_sites})
# calculate BAC seperately for each site
for name,group in test_sites_df.groupby('site'):
bac = balanced_accuracy_score(group['y_true'],group['y_pred'])
site_scores.append((train_site_name,train_site_train_score,name,bac))
df = pd.DataFrame(site_scores,columns=['train_site','train_site_score','test_site','test_site_score'])
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。