微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

为什么我的 AdaBoost 实现一遍又一遍地重复相同的两个拆分?

如何解决为什么我的 AdaBoost 实现一遍又一遍地重复相同的两个拆分?

我尝试使用决策树桩和 Gini 指数自己编写 AdaBoost 算法进行拆分。代码如下:

class AdaBoost:
    def main(x,y):
        # set the weights to 1/n
        weights=np.empty(x.shape[0])
        weights.fill(1/x.shape[0])
        
        classifier_weights=[]
        features=[]
        thresholds=[]

        for i in range(4):
            # create decision stump
            feature,threshold=AdaBoost.find_split(x,y,weights)
            features.append(feature)
            thresholds.append(threshold)
        
            # evaluate decision stump
            final_say,predictions=AdaBoost.evaluate_classifier(x,weights,feature,threshold)
            classifier_weights.append(final_say)
        
            # adjust sample weights
            weights=AdaBoost.adjust_sample_weights(x,final_say,predictions)

        # classification
        classification=0
        predictions=[]
        for i in range(x.shape[0]):
            for j in range(len(features)):
                if x[i,features[j]]>thresholds[j]:
                    y_hat=1
                else:
                    y_hat=-1
                classification+=classifier_weights[j]*y_hat
            if classification>0:
                predictions.append(1)
            else:
                predictions.append(-1)
            classification=0
        
        # classification accuracy
        correct=0
        for i in range(x.shape[0]):
            if predictions[i]==y[i]:
                correct+=1
        accuracy=correct/x.shape[0]
        
        return accuracy,features,thresholds

    def find_split(x,weights):
        gini=[]
        feature_gini=[]
        thresholds=[]
        for i in range(x.shape[1]): # cycle through features
            for j in range(x.shape[0]): # cycle through all values and evaluate as thresholds
                gini.append(AdaBoost.evaluate_num_split(x,i,x[j,i]))
            feature_gini.append(min(gini))
            thresholds.append(gini.index(min(gini)))
            gini=[]
        # feature_gini is a list containing the lowest gini value for every feature
        # therefore,the split occurs on the feature with the lowest min gini
        feature=feature_gini.index(min(feature_gini))
        # we also need to kNow which threshold lead to this lowest gini
        threshold=x[thresholds[feature],feature]
        return feature,threshold
                      
    def evaluate_num_split(x,threshold): # evaluate split with numeric values
        ye=[]
        nah=[]
        
        # loop puts index of samples into corresponding lists so that
        # we can access both x and y via the index
        for i in range(x.shape[0]):
            if x[i,feature]>threshold:
                ye.append(i)
            else:
                nah.append(i)
        
        return AdaBoost.evaluate_gini(ye,nah,x,weights)
     
    def evaluate_gini(ye,weights):
        # evaluate Gini index
        weights_ye=0
        weights_nah=0
        corr=0
        wrong=0
        
        # determine weights for yes and no
        for i in ye:
            weights_ye+=weights[i]
        for i in nah:
            weights_nah+=weights[i]
        
        # prevent an error for dividing by zero later
        if weights_ye==0:
            return 100
        elif weights_nah==0:
            return 100
        else:
            pass
        
        # determine gini for 'yes' leaf
        for i in ye:
            if y[i]==1:
                corr+=weights[i]
            else:
                wrong+=weights[i]
        gini_ye=1-(corr/weights_ye)**2-(wrong/weights_ye)**2
        
        # determine gini for 'no' leaf
        corr=0
        wrong=0
        for i in nah:
            if y[i]==1:
                corr+=weights[i]
            else:
                wrong+=weights[i]
        gini_nah=1-(corr/weights_nah)**2-(wrong/weights_nah)**2
        
        return weights_ye*gini_ye+weights_nah*gini_nah # return weighted gini between both leaves
        
    def evaluate_classifier(x,threshold):
        total_error=0
        predictions=[]
        for i in range(x.shape[0]):
            if x[i,feature]>threshold:
                y_hat=1
            else:
                y_hat=-1
            if y_hat!=y[i]:
                total_error+=weights[i]
            else:
                pass
            predictions.append(y_hat)
        return 0.5*np.log((1-total_error)/total_error),predictions
    
    def adjust_sample_weights(x,predictions):
        summation=0
        for i in range(x.shape[0]):
            weights[i]=weights[i]+np.exp(-y[i]*predictions[i]*final_say)
        for i in weights:
            summation+=i
        for i in range(x.shape[0]):
            weights[i]=weights[i]/summation
        return weights

我使用了一个具有 12 个特征和 400 多个样本的糖尿病数据集。创建第一个决策树桩后,准确率为 60%。第二个树桩也可以正常工作,并将准确度提高到 68%。但是,接下来的树桩一遍又一遍地重复这两个相同的分裂。他们在具有相同阈值的相同特征上分裂。也许我错误地调整了权重。我尝试了一整天对代码进行故障排除并发现了一些错误,但我无法在这里找到问题所在。

调用 main 函数会产生以下输出。返回的是训练集中分类的准确率、决策树桩分割的特征列表和选择的要分割的阈值列表:

AdaBoost.main(x,y)

(0.6780045351473923,[5,2,5,2],[44.0,88.0,44.0,88.0])

P.S:我确信代码对于我在这里尝试做的事情来说太混乱和复杂了。我是编码新手,所以如果我系统地做的事情太复杂或错误,请告诉我。

非常感谢您阅读所有这些凌乱的代码。尽我所能评论出来。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。