微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

构建决策树分类

如何解决构建决策树分类

我有两个数据集, partb_data1partb_data2 。给出反映客户特征的银行客户样本以及银行是否继续与他们合作(流失)。 退出churn(如果他离开了银行,则为 1,如果他继续与银行合作,则为 0)我使用 partb_data1 作为训练集,partb_data2 作为测试集

这是我的数据:

> dput(head(partb_data1))
structure(list(RowNumber = 1:6,CustomerId = c(15634602L,15647311L,15619304L,15701354L,15737888L,15574012L),Surname = c("Hargrave","Hill","Onio","Boni","Mitchell","Chu"),Creditscore = c(619L,608L,502L,699L,850L,645L),Geography = c("France","Spain","France","Spain"),Gender = c("Female","Female","Male"),Age = c(42L,41L,42L,39L,43L,44L),Tenure = c(2L,1L,8L,2L,8L),Balance = c(0,83807.86,159660.8,125510.82,113755.78),NumOfProducts = c(1L,3L,2L),HasCrCard = c(1L,0L,1L),IsActiveMember = c(1L,0L),EstimatedSalary = c(101348.88,112542.58,113931.57,93826.63,79084.1,149756.71),Exited = c(1L,1L)),row.names = c(NA,6L),class = "data.frame")



> dput(head(partb_data2))
structure(list(RowNumber = 8001:8006,CustomerId = c(15629002L,15798053L,15753895L,15595426L,15645815L,15632848L),Surname = c("Hamilton","Nnachetam","Blue","Madukwe","Mills","Ferrari"),Creditscore = c(747L,707L,590L,603L,615L,634L),Geography = c("Germany","France"),Gender = c("Male","Male","Female"),Age = c(36L,32L,37L,57L,45L,36L),Tenure = c(8L,9L,6L,5L,Balance = c(102603.3,105000.85,69518.95),NumOfProducts = c(2L,EstimatedSalary = c(180693.61,126475.79,133535.99,87412.24,164886.64,116238.39),Exited = c(0L,0L)),class = "data.frame")

我创建了分类树以预测流失。代码如下:

library(tidyverse)
library(caret)
library(rpart)
library(rpart.plot)

# Split the data into training and test set
train.data <- head(partb_data1,500)
test.data <- tail(partb_data2,150)

# Build the model
modelb <- rpart(Exited ~.,data = train.data,method = "class")
# Visualize the decision tree with rpart.plot
rpart.plot(modelb)

# Make predictions on the test data
predicted.classes <- modelb %>% 
  predict(test.data,type = "class")
head(predicted.classes)

# Compute model accuracy rate on test data
mean(predicted.classes == test.data$Exited)

### Pruning the tree :

# Fit the model on the training set
modelb2 <- train(
  Exited ~.,method = "rpart",trControl = trainControl("cv",number = 10),tuneLength = 10
)
# Plot model accuracy vs different values of
# cp (complexity parameter)
plot(modelb2)

# Print the best tuning parameter cp that
# maximizes the model accuracy
modelb2$bestTune

# Plot the final tree model
plot(modelb2$finalModel)

# Make predictions on the test data
predicted.classes <- modelb2 %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$Exited)

注意:我已经从 partb_data2 制作了测试集。

我遵循的程序正确吗?我必须进行任何更改才能实现我的目标,即分类树?非常欢迎您的帮助!

已编辑!!!

解决方法

你的 head(partb_data1$Exited,500) 不是 data.frame。由于 $,您需要获取 partb_data1 数据的子集。它只是一个整数向量,所以行不通。

class(head(partb_data1$Exited,500)) [1] "整数"

,

总是有很多程序选项。

但是您将数据分离为训练数据集和测试数据集是正确的。也可以改用交叉验证。您正在训练集上使用交叉验证,这通常不是必需的,但也是可能的。

我认为将您的完整数据用于 cv 也应该有效,但您所做的并没有错。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。