如何解决将数据框列字符串值转换为虚拟变量列
我有以下数据框(不包括其余列):
| customer_id | department |
| ----------- | ----------------------------- |
| 11 | ['nail','men_skincare'] |
| 23 | ['nail','fragrance'] |
| 25 | [] |
| 45 | ['skincare','men_fragrance'] |
我正在预处理我的数据以适合模型。我想将部门变量转换为每个独特部门类别的虚拟变量(无论有多少独特部门,不仅限于这里的内容)。
想要得到这样的结果:
| customer_id | department | nail | men_skincare | fragrance | skincare | men_fragrance |
| ----------- | ---------- | ---- | ------------ | --------- | -------- | ------------- |
| 11 | ['nail','men_skincare'] | 1 | 1 | 0 | 0 | 0 |
| 23 | ['nail','fragrance'] | 1 | 0 | 1 | 0 | 0 |
| 25 | [] | 0 | 0 | 0 | 0 | 0 |
| 45 | ['skincare','men_fragrance'] | 0 | 0 | 0 | 1 | 1 |
我试过这个link,但是当我拼接它时,它把它当作一个字符串,只为字符串中的每个字符创建一个列;我用过的:
df['1st'] = df['department'].str[0]
df['2nd'] = df['department'].str[1]
df['3rd'] = df['department'].str[2]
df['4th'] = df['department'].str[3]
df['5th'] = df['department'].str[4]
df['6th'] = df['department'].str[5]
df['7th'] = df['department'].str[6]
df['8th'] = df['department'].str[7]
df['9th'] = df['department'].str[8]
df['10th'] = df['department'].str[9]
然后我尝试拆分字符串并使用以下方法转换为列表:
df['new_column'] = df['department'].apply(lambda x: x.split(","))
然后再试一次,仍然只为每个字符创建列。
有什么建议吗?
编辑:我使用 anky 发送的链接找到了答案,特别是我使用了这个:https://stackoverflow.com/a/29036042
什么对我有用:
df['department'] = df['department'].str.replace("'",'').str.replace("]",'').str.replace("[",'').str.replace(' ','')
df['department'] = df['department'].apply(lambda x: x.split(","))
s = df['department']
df1 = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.merge(df,df1,right_index=True,left_index=True,how = 'left')
解决方法
import pandas as pd
您可以通过 explode()
、value_counts()
和 fillna()
方法执行此操作:
data=df.explode('department').fillna('empty')
现在使用 crosstab()
方法:
data=pd.crosstab(data['customer_id'],data['department'])
由于 concat()
方法给您一个错误,所以使用 merge()
方法和 drop()
方法:
data=pd.merge(df.set_index('customer_id'),data,left_index=True,right_index=True).drop(columns=['empty'])
现在,如果您打印 data
,您将获得所需的输出:
这是一个基于 fast binarizer method 的链接使用 sklearn 的 MultiLabelBinarizer
的 anky:
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame({'customer_id':{0:11,1:23,2:25,3:45},'department':{0:["'nail'","'men_skincare'"],1:["'nail'","'fragrance'"],2:[''],3:["'skincare'","'men_fragrance'"]}})
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(
mlb.fit_transform(df.department),columns=[c.strip("'") for c in mlb.classes_],index=df.index,)).drop(columns='')
# customer_id department fragrance men_fragrance men_skincare nail skincare
# 0 11 ['nail','men_skincare'] 0 0 1 1 0
# 1 23 ['nail','fragrance'] 1 0 0 1 0
# 2 25 [] 0 0 0 0 0
# 3 45 ['skincare','men_fragrance'] 0 1 0 0 1
注意:这假设您的真实数据的 department
列包含实际的 Python 列表,而不是看起来像列表的字符串。如果它们实际上是字符串(即 type(df.department[0])
输出 str
),则需要先完成此转换:
df.department = df.department.str.strip('[]').str.split(r'\s*,\s*')
,
试试:
df.merge(pd.get_dummies(df.set_index('customer_id')
.explode('department'),prefix='',prefix_sep='').sum(level=0),left_on='customer_id',right_index=True)
输出:
customer_id department fragrance men_fragrance men_skincare nail skincare
0 11 [nail,men_skincare] 0 0 1 1 0
1 23 [nail,fragrance] 1 0 0 1 0
2 25 [] 0 0 0 0 0
3 45 [skincare,men_fragrance] 0 1 0 0 1
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。