如何使用SMOTE算法中的词典对多类输入数据进行不同的重采样？

如何解决如何使用SMOTE算法中的词典对多类输入数据进行不同的重采样？

我想使用imblearn.over_sampling库在python中使用SMOTE算法执行过采样。我的输入数据有四个目标类。我不想对所有少数派的分布进行过度抽样以与多数派的分布相匹配。我想以不同的方式对每个少数族裔进行超采样。

使用SMOTE(sampling_strategy = 1,k_neighbors=2,random_state = 1000)时，出现以下错误。

ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class,use a dict.

然后，根据错误，我为“ sampling_strategy”使用了如下字典，

SMOTE(sampling_strategy={'1.0':70,'3.0':255,'2.0':50,'0.0':150},random_state = 1000)

但是，它给出了以下错误，

ValueError: The {'2.0','1.0','0.0','3.0'} target class is/are not present in the data.

有人知道我们如何定义字典以使用SMOTE对数据进行过采样吗？

解决方法

您必须为每个类指定所需的样本数，然后将此字典传递给SMOTE对象。

代码：

import numpy as np
from collections import Counter
from imblearn.over_sampling import SMOTE

x1 = np.random.randint(500,size =(200,13))
y1 = np.concatenate([np.array([0]*100),np.array([1]*65),np.array([2]*25),np.array([3]*10)])
np.random.shuffle(y1)
Counter(y1)

输出：

Counter({0: 100,1: 65,2: 25,3: 10})

代码：

sm = SMOTE(sampling_strategy = {0: 100,1: 70,2: 90,3: 40})
X_res,y_res = sm.fit_resample(x1,y1)
Counter(y_res)

输出：

Counter({0: 100,3: 40})

有关更多信息，请参见文档here。

您收到的错误是因为字典中指定的标签与实际标签不匹配。