带有分层 KFold 的随机过采样 - 值误差

如何解决带有分层 KFold 的随机过采样 - 值误差

我有一个看起来像这样的数据框。数据集使用标准定标器和为所有分类变量添加的虚拟变量进行标准化。现在分为训练集和测试集。

            amt    gender   city_pop    birth_year  distance        
153118  -0.786537   0.0    -0.318571    0.913779    -0.400876   
153226  -0.488455   0.0    -0.322397    0.741579     1.384297   
153228  0.437970    0.0    -0.329167    1.774776    -0.658839   
153303  -0.877627   0.0    -0.329656    1.258177    -1.100713   
153313  0.462143    1.0    -0.313817    1.372977     0.038791

我现在正在尝试使用 RandomOverSampler 和 StratifiedKFold Cross Validatio 使用这些数据创建一些模型（如逻辑回归、决策树和随机森林）。这是因为我的目标变量上的少数类只有 0.3%。

我已经使用不平衡数据创建了模型，并且运行良好。但是当我尝试进行采样时，出现以下错误。还包括我的代码。

from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import RandomOverSampler

skf = StratifiedKFold(n_splits=5,random_state=None)

for fold,(train_index,test_index) in enumerate(skf.split(X,y),1):
    X_train = X.reindex(index = train_index)
    y_train = y.reindex(index = train_index) 
    X_test = X.reindex(index = test_index)
    y_test = y.reindex(index = test_index)
    ROS = RandomOverSampler(sampling_strategy=0.5)
    X_over,y_over= ROS.fit_resample(X_train,y_train)
  
#Create Dataframe for X_over
X_over = pd.DataFrame(data=X_over,columns=X_train.columns)

我收到以下错误。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-90-372645e869d1> in <module>
      4 oversample = RandomOverSampler(sampling_strategy=1)
      5 # fit and apply the transform
----> 6 X_over,y_over = oversample.fit_resample(X_train,y_train)

~\anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self,X,y)
     73             The corresponding label of `X_resampled`.
     74         """
---> 75         check_classification_targets(y)
     76         arrays_transformer = ArraysTransformer(X,y)
     77         X,y,binarize_y = self._check_X_y(X,y)

~\anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
    178     y : array-like
    179     """
--> 180     y_type = type_of_target(y)
    181     if y_type not in ['binary','multiclass','multiclass-multIoUtput',182                       'multilabel-indicator','multilabel-sequences']:

~\anaconda3\lib\site-packages\sklearn\utils\multiclass.py in type_of_target(y)
    301     if y.dtype.kind == 'f' and np.any(y != y.astype(int)):
    302         # [.1,.2,3] or [[.1,3]] or [[1.,.2]] and not [1.,2.,3.]
--> 303         _assert_all_finite(y)
    304         return 'continuous' + suffix
    305 

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X,allow_nan,msg_dtype)
    104                     msg_err.format
    105                     (type_err,--> 106                      msg_dtype if msg_dtype is not None else X.dtype)
    107             )
    108     # for object dtype data,we only check for NaNs (GH-13254)

ValueError: Input contains NaN,infinity or a value too large for dtype('float64').

解决方法

最好是看到数据后再回答。但我建议在交叉验证步骤之前进行过采样。请尝试一下。