如何解决为什么我的 AdaBoost 实现一遍又一遍地重复相同的两个拆分?
我尝试使用决策树桩和 Gini 指数自己编写 AdaBoost 算法进行拆分。代码如下:
class AdaBoost:
def main(x,y):
# set the weights to 1/n
weights=np.empty(x.shape[0])
weights.fill(1/x.shape[0])
classifier_weights=[]
features=[]
thresholds=[]
for i in range(4):
# create decision stump
feature,threshold=AdaBoost.find_split(x,y,weights)
features.append(feature)
thresholds.append(threshold)
# evaluate decision stump
final_say,predictions=AdaBoost.evaluate_classifier(x,weights,feature,threshold)
classifier_weights.append(final_say)
# adjust sample weights
weights=AdaBoost.adjust_sample_weights(x,final_say,predictions)
# classification
classification=0
predictions=[]
for i in range(x.shape[0]):
for j in range(len(features)):
if x[i,features[j]]>thresholds[j]:
y_hat=1
else:
y_hat=-1
classification+=classifier_weights[j]*y_hat
if classification>0:
predictions.append(1)
else:
predictions.append(-1)
classification=0
# classification accuracy
correct=0
for i in range(x.shape[0]):
if predictions[i]==y[i]:
correct+=1
accuracy=correct/x.shape[0]
return accuracy,features,thresholds
def find_split(x,weights):
gini=[]
feature_gini=[]
thresholds=[]
for i in range(x.shape[1]): # cycle through features
for j in range(x.shape[0]): # cycle through all values and evaluate as thresholds
gini.append(AdaBoost.evaluate_num_split(x,i,x[j,i]))
feature_gini.append(min(gini))
thresholds.append(gini.index(min(gini)))
gini=[]
# feature_gini is a list containing the lowest gini value for every feature
# therefore,the split occurs on the feature with the lowest min gini
feature=feature_gini.index(min(feature_gini))
# we also need to kNow which threshold lead to this lowest gini
threshold=x[thresholds[feature],feature]
return feature,threshold
def evaluate_num_split(x,threshold): # evaluate split with numeric values
ye=[]
nah=[]
# loop puts index of samples into corresponding lists so that
# we can access both x and y via the index
for i in range(x.shape[0]):
if x[i,feature]>threshold:
ye.append(i)
else:
nah.append(i)
return AdaBoost.evaluate_gini(ye,nah,x,weights)
def evaluate_gini(ye,weights):
# evaluate Gini index
weights_ye=0
weights_nah=0
corr=0
wrong=0
# determine weights for yes and no
for i in ye:
weights_ye+=weights[i]
for i in nah:
weights_nah+=weights[i]
# prevent an error for dividing by zero later
if weights_ye==0:
return 100
elif weights_nah==0:
return 100
else:
pass
# determine gini for 'yes' leaf
for i in ye:
if y[i]==1:
corr+=weights[i]
else:
wrong+=weights[i]
gini_ye=1-(corr/weights_ye)**2-(wrong/weights_ye)**2
# determine gini for 'no' leaf
corr=0
wrong=0
for i in nah:
if y[i]==1:
corr+=weights[i]
else:
wrong+=weights[i]
gini_nah=1-(corr/weights_nah)**2-(wrong/weights_nah)**2
return weights_ye*gini_ye+weights_nah*gini_nah # return weighted gini between both leaves
def evaluate_classifier(x,threshold):
total_error=0
predictions=[]
for i in range(x.shape[0]):
if x[i,feature]>threshold:
y_hat=1
else:
y_hat=-1
if y_hat!=y[i]:
total_error+=weights[i]
else:
pass
predictions.append(y_hat)
return 0.5*np.log((1-total_error)/total_error),predictions
def adjust_sample_weights(x,predictions):
summation=0
for i in range(x.shape[0]):
weights[i]=weights[i]+np.exp(-y[i]*predictions[i]*final_say)
for i in weights:
summation+=i
for i in range(x.shape[0]):
weights[i]=weights[i]/summation
return weights
我使用了一个具有 12 个特征和 400 多个样本的糖尿病数据集。创建第一个决策树桩后,准确率为 60%。第二个树桩也可以正常工作,并将准确度提高到 68%。但是,接下来的树桩一遍又一遍地重复这两个相同的分裂。他们在具有相同阈值的相同特征上分裂。也许我错误地调整了权重。我尝试了一整天对代码进行故障排除并发现了一些错误,但我无法在这里找到问题所在。
调用 main 函数会产生以下输出。返回的是训练集中分类的准确率、决策树桩分割的特征列表和选择的要分割的阈值列表:
AdaBoost.main(x,y)
(0.6780045351473923,[5,2,5,2],[44.0,88.0,44.0,88.0])
P.S:我确信代码对于我在这里尝试做的事情来说太混乱和复杂了。我是编码新手,所以如果我系统地做的事情太复杂或错误,请告诉我。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。