将数据框列字符串值转换为虚拟变量列

如何解决将数据框列字符串值转换为虚拟变量列

我有以下数据框(不包括其余列):

| customer_id | department                    |
| ----------- | ----------------------------- |
| 11          | ['nail','men_skincare']      |
| 23          | ['nail','fragrance']         |
| 25          | []                            |
| 45          | ['skincare','men_fragrance'] |

我正在预处理我的数据以适合模型。我想将部门变量转换为每个独特部门类别的虚拟变量(无论有多少独特部门,不仅限于这里的内容)。

想要得到这样的结果:

| customer_id | department                    | nail | men_skincare | fragrance | skincare | men_fragrance |
| ----------- | ----------                    | ---- | ------------ | --------- | -------- | ------------- |
| 11          | ['nail','men_skincare']      | 1    | 1            | 0         | 0        | 0             |
| 23          | ['nail','fragrance']         | 1    | 0            | 1         | 0        | 0             |
| 25          | []                            | 0    | 0            | 0         | 0        | 0             |
| 45          | ['skincare','men_fragrance'] | 0    | 0            | 0         | 1        | 1             |

我试过这个link,但是当我拼接它时,它把它当作一个字符串,只为字符串中的每个字符创建一个列;我用过的:

df['1st'] = df['department'].str[0]
df['2nd'] = df['department'].str[1]
df['3rd'] = df['department'].str[2]
df['4th'] = df['department'].str[3]
df['5th'] = df['department'].str[4]
df['6th'] = df['department'].str[5]
df['7th'] = df['department'].str[6]
df['8th'] = df['department'].str[7]
df['9th'] = df['department'].str[8]
df['10th'] = df['department'].str[9]

然后我尝试拆分字符串并使用以下方法转换为列表:

df['new_column'] = df['department'].apply(lambda x: x.split(","))

然后再试一次,仍然只为每个字符创建列。

有什么建议吗?

编辑:我使用 anky 发送的链接找到了答案,特别是我使用了这个:https://stackoverflow.com/a/29036042

什么对我有用:

df['department'] = df['department'].str.replace("'",'').str.replace("]",'').str.replace("[",'').str.replace(' ','')
df['department'] = df['department'].apply(lambda x: x.split(","))
s = df['department']
df1 = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.merge(df,df1,right_index=True,left_index=True,how = 'left')

解决方法

import pandas as pd

您可以通过 explode()value_counts()fillna() 方法执行此操作:

data=df.explode('department').fillna('empty')

现在使用 crosstab() 方法:

data=pd.crosstab(data['customer_id'],data['department'])

由于 concat() 方法给您一个错误,所以使用 merge() 方法和 drop() 方法:

data=pd.merge(df.set_index('customer_id'),data,left_index=True,right_index=True).drop(columns=['empty'])

现在,如果您打印 data,您将获得所需的输出:

enter image description here

,

这是一个基于 fast binarizer method 的链接使用 sklearn 的 MultiLabelBinarizeranky

from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame({'customer_id':{0:11,1:23,2:25,3:45},'department':{0:["'nail'","'men_skincare'"],1:["'nail'","'fragrance'"],2:[''],3:["'skincare'","'men_fragrance'"]}})
mlb = MultiLabelBinarizer()

df = df.join(pd.DataFrame(
    mlb.fit_transform(df.department),columns=[c.strip("'") for c in mlb.classes_],index=df.index,)).drop(columns='')

#   customer_id                     department  fragrance  men_fragrance  men_skincare  nail  skincare
# 0          11       ['nail','men_skincare']          0              0             1     1         0
# 1          23          ['nail','fragrance']          1              0             0     1         0
# 2          25                             []          0              0             0     0         0
# 3          45  ['skincare','men_fragrance']          0              1             0     0         1

注意:这假设您的真实数据的 department 列包含实际的 Python 列表,而不是看起来像列表的字符串。如果它们实际上是字符串(即 type(df.department[0]) 输出 str),则需要先完成此转换:

df.department = df.department.str.strip('[]').str.split(r'\s*,\s*')
,

试试:

df.merge(pd.get_dummies(df.set_index('customer_id')
                          .explode('department'),prefix='',prefix_sep='').sum(level=0),left_on='customer_id',right_index=True)

输出:

   customer_id                 department  fragrance  men_fragrance  men_skincare  nail  skincare
0           11       [nail,men_skincare]          0              0             1     1         0
1           23          [nail,fragrance]          1              0             0     1         0
2           25                         []          0              0             0     0         0
3           45  [skincare,men_fragrance]          0              1             0     0         1

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)> insert overwrite table dwd_trade_cart_add_inc > select data.id, > data.user_id, > data.course_id, > date_format(
错误1 hive (edu)> insert into huanhuan values(1,'haoge'); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive> show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 <configuration> <property> <name>yarn.nodemanager.res