微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

CatBoostClassifier 特征的静态顺序

如何解决CatBoostClassifier 特征的静态顺序

我在 kaggle 的 Titanic 数据集上拟合了 catboost 模型:

train_df = pd.read_csv('input/train.csv')
test_df = pd.read_csv('input/test.csv')

train_df.fillna(-999,inplace=True)
test_df.fillna(-999,inplace=True)

x = train_df.drop('Survived',axis=1)
y = train_df.Survived

cate_features_index = np.where(x.dtypes != float)[0]

xtrain,xtest,ytrain,ytest = train_test_split(x,y,train_size=.85,random_state=1234)

model = catboostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42)

model.fit(xtrain,cat_features=cate_features_index,eval_set=(xtest,ytest))

如果我的对象具有与 train_df 相同的功能顺序,代码工作正常:

{
      "PassengerId": "892","Pclass": "3","Name": "Kelly,Mr. James","Sex": "female","Age": "34.5","SibSp": "0","Parch": "0","Ticket": "330911","fare": "7.8292","Cabin": "","Embarked": "Q",}

但如果我更改订单,例如:

{
      "Age": "34.5","PassengerId": "892","Ticket": "330911"
    }

出现错误

_catboost.catboostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=4]="Kelly,Mr. James": Cannot convert 'b'Kelly,Mr. James'' to float

是否可以在没有所需特征顺序的情况下拟合模型?

解决方法

您可以使用名称而不是索引来指定分类特征,在这种情况下,它们在数据框中的顺序无关紧要。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier

# import the data
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

# extract the features and target
X = df_train.drop('Survived',axis=1)
y = df_train['Survived']

# extract the names of the categorical features
cat_features = X.columns[np.where(X.dtypes != float)[0]].values.tolist()
print(cat_features)
# ['PassengerId','Pclass','Name','Sex','SibSp','Parch','Ticket','Cabin','Embarked']

# make sure that the categorical features are encoded as strings
X[cat_features] = X[cat_features].astype(str)

# split the data
X_train,X_valid,y_train,y_valid = train_test_split(X,y,train_size=0.85,random_state=1234)

# train the model
model = CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=1234)
model.fit(X_train,cat_features=cat_features,eval_set=(X_valid,y_valid))

print('Best Iteration: {}'.format(model.best_iteration_))
print('Training Accuracy: {:.2%}'.format(model.best_score_['learn']['Accuracy']))
print('Validation Accuracy: {:.2%}'.format(model.best_score_['validation']['Accuracy']))
# Best Iteration: 347
# Training Accuracy: 96.96%
# Validation Accuracy: 85.07%

# generate the model predictions
df_test[cat_features] = df_test[cat_features].astype(str)
y_pred = model.predict(df_test)
print(y_pred)
# [0 0 0 0 0 0 1 0 1 . . . 0 1 0 1 1 0 0 1 0 0 1]

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。