如何解决带有分层 KFold 的随机过采样 - 值误差


            amt    gender   city_pop    birth_year  distance        
153118  -0.786537   0.0    -0.318571    0.913779    -0.400876   
153226  -0.488455   0.0    -0.322397    0.741579     1.384297   
153228  0.437970    0.0    -0.329167    1.774776    -0.658839   
153303  -0.877627   0.0    -0.329656    1.258177    -1.100713   
153313  0.462143    1.0    -0.313817    1.372977     0.038791   

我现在正在尝试使用 RandomOverSampler 和 StratifiedKFold Cross Validatio 使用这些数据创建一些模型(如逻辑回归、决策树和随机森林)。这是因为我的目标变量上的少数类只有 0.3%。


from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import RandomOverSampler

skf = StratifiedKFold(n_splits=5,random_state=None)

for fold,(train_index,test_index) in enumerate(skf.split(X,y),1):
    X_train = X.reindex(index = train_index)
    y_train = y.reindex(index = train_index) 
    X_test = X.reindex(index = test_index)
    y_test = y.reindex(index = test_index)
    ROS = RandomOverSampler(sampling_strategy=0.5)
    X_over,y_over= ROS.fit_resample(X_train,y_train)
#Create Dataframe for X_over
X_over = pd.DataFrame(data=X_over,columns=X_train.columns)


ValueError                                Traceback (most recent call last)
<ipython-input-90-372645e869d1> in <module>
      4 oversample = RandomOverSampler(sampling_strategy=1)
      5 # fit and apply the transform
----> 6 X_over,y_over = oversample.fit_resample(X_train,y_train)

~\anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self,X,y)
     73             The corresponding label of `X_resampled`.
     74         """
---> 75         check_classification_targets(y)
     76         arrays_transformer = ArraysTransformer(X,y)
     77         X,y,binarize_y = self._check_X_y(X,y)

~\anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
    178     y : array-like
    179     """
--> 180     y_type = type_of_target(y)
    181     if y_type not in ['binary','multiclass','multiclass-multIoUtput',182                       'multilabel-indicator','multilabel-sequences']:

~\anaconda3\lib\site-packages\sklearn\utils\multiclass.py in type_of_target(y)
    301     if y.dtype.kind == 'f' and np.any(y != y.astype(int)):
    302         # [.1,.2,3] or [[.1,3]] or [[1.,.2]] and not [1.,2.,3.]
--> 303         _assert_all_finite(y)
    304         return 'continuous' + suffix

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X,allow_nan,msg_dtype)
    104                     msg_err.format
    105                     (type_err,--> 106                      msg_dtype if msg_dtype is not None else X.dtype)
    107             )
    108     # for object dtype data,we only check for NaNs (GH-13254)

ValueError: Input contains NaN,infinity or a value too large for dtype('float64').


最好是看到数据后再回答。但我建议在交叉验证步骤之前进行过采样。 请尝试一下。

