如何解决重新采样数据时的过度拟合和交叉验证k-fold
我想减少模型中的过度拟合。在特征选择过程中,我已经运行了多重共线性测试以排除模型中的特征。 现在我需要应用 k 折交叉验证。 这是一个文本分类问题,正好用于检测垃圾邮件/非垃圾邮件。我提取了几个特征,为简单起见,我只是将它们表示为分类、数字、文本。 一世 我做了以下事情:
# DeFinition of X and y
X=df[text_feature + categorical_features + numerical_features]
y=df[['Label']]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.20)
# Applying downsampling
# Separating classes
def downsampling(data):
spam = data[data.Label == 1]
not_spam = data[data.Label == 0]
# Downsampling the majority
downsample = resample(spam,replace=True,n_samples=len(not_spam),random_state=42)
# Returning to new training set
downsample_train = pd.concat([not_spam,oversample])
return downsample_train
downsample_train = downsampling(X_train)
train_df= downsample_train.copy()
test_df = pd.concat([X_test,y_test],axis=1)
# Creating the Bag of Words model and apply other pre-processors
categorical_preprocessing = OneHotEncoder(handle_unkNown='ignore')
numeric_preprocessing = Pipeline([
('imputer',SimpleImputer(strategy='mean')
])
# CountVectorizer
text_preprocessing_cv = Pipeline(steps=[
('CV',CountVectorizer())
])
# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
('TF-IDF',TfidfVectorizer())
])
preprocessing = ColumnTransformer(
transformers=[
('text',text_preprocessing_cv,'Text')
('category',categorical_preprocessing,categorical_features),('numeric',numeric_preprocessing,numerical_features)
],remainder='passthrough')
clf_lr = Pipeline(steps=[('preprocessor',preprocessing),('classifier',LogisticRegression())])
pipelines(clf_lr,X_train,X_test)
我正在考虑的功能示例是
- 文本(例如,您赢了一个惊人的价格!!!,亲爱的约翰,我希望您准备好迎接这个好消息!!!!!!:),...)
- 年份(例如,2019 年、2020 年、...)
- #_of_characters_Subj(例如,34、67、...):该值来自主题
- 地址(例如,abc@gmail.com、ghi@yahoo.com ...)
- Spam (e.g.,1,...) :这是一个布尔变量。垃圾邮件为 1,非垃圾邮件为 0
据我所知,当运行重采样时,它仅在火车集上运行以避免高估。如果 k=5,k 折验证拆分应应用于训练数据(例如 4 折)和测试数据(1 折)。 我尝试使用函数包含交叉验证:
def bc_matrix(classifier):
k_fold = KFold(n_splits=5)
scores = []
confusion = np.array([[0,0],[0,0]])
for train_ind,test_ind in k_fold.split(train_df):
# Train
train_c = train_df.iloc[train_ind]
train_y = train_df.iloc[train_ind]['Label']
# Test
test_c =train_df.iloc[test_ind]
test_y = train_df.iloc[test_ind]['Label']
classifier.fit(train_c,train_y) # Fit the model
predictions = classifier.predict(test_c)
confusion += confusion_matrix(test_y,predictions)
return (
#K-fold cross validation for each classifier
bc_matrix(clf_lr)
但是这里有一个问题:
---> 19classifier.fit(train_feat,train_y) #拟合模型
IndexError:元组索引超出范围。
数据示例:
Text Year #_of_characters_Subj
You won an amazing price!!! 2019 34
Dear John,I hope you are ready for this great news!!!!!!!:) 2020 67
It is awesome 2012 56
Address Spam
abc@gmail.com 1
ghi@yahoo.com 0
yes_we_can@live.com 1
哪里
如果您能提供一些帮助来修复错误以预测测试结果,我将不胜感激(希望 cv 应该有助于减少过度拟合)。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。