我正试图解决kaggle的泰坦尼克号生存计划.这是我实际学习机器学习的第一步.我有一个问题,性别列导致错误. stacktrace说无法将字符串转换为float:’female’.你们是怎么遇到这个问题的?我不想要解决方案.我只是想要一个实用的方法解决这个问题,因为我确实需要性别列来构建我的模型.
这是我的代码:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)
解决方法:
>您可以将类别编码为数值,即将类别的每个级别转换为不同的数字,
要么
> dummy code您的类别,即将您的类别的每个级别转换为单独的列,其值为0或1.
在许多机器学习应用程序中,处理虚拟代码的因素更好.
注意,在2级类别的情况下,根据下面概述的方法编码为数字基本上等同于虚拟编码:所有非0级的值必须是1级.实际上,在虚拟代码示例中下面给出了冗余信息,因为我已经给出了两个类中的每个类都有自己的列.这只是为了说明这个概念.通常,只能创建n-1列,其中n是级别数,隐含的省略级别(即为Female创建一列,并且所有0值都隐含为Male).
将类别编码为数字:
pd.factorize是一种简单,快速的数字编码方式:
例如,如果您的列性别如下所示:
>>> df
gender
0 Female
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Female
8 Female
9 Female
df['gender_factor'] = pd.factorize(df.gender)[0]
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
另一种方法是使用类别dtype:
df['gender_factor'] = df['gender'].astype('category').cat.codes
这将导致相同的输出
方法3 sklearn.preprocessing.LabelEncoder()
这种方法带有一些奖励,例如简单的反向转换:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
# Easy to back transform:
df['gender_factor'] = le.inverse_transform(df.gender_factor)
>>> df
gender gender_factor
0 Female Female
1 Male Male
2 Male Male
3 Male Male
4 Female Female
5 Female Female
6 Male Male
7 Female Female
8 Female Female
9 Female Female
虚拟编码:
df.join(pd.get_dummies(df.gender))
gender Female Male
0 Female 1 0
1 Male 0 1
2 Male 0 1
3 Male 0 1
4 Female 1 0
5 Female 1 0
6 Male 0 1
7 Female 1 0
8 Female 1 0
9 Female 1 0
注意,如果您想省略一列以获得非冗余的虚拟代码(请参阅本答案开头的注释),您可以使用:
df.join(pd.get_dummies(df.gender, drop_first=True))
gender Male
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。