如何解决根据值合并列名以创建另一列
我有一个包含各种电影类型以及电影是否属于该类型的电影数据集。例如
Index Biography Comedy Crime Documentary Drama Family Fantasy
0 0 1 0 0 1 1 0
1 0 1 0 0 0 1 0
2 0 0 0 0 1 0 0
3 0 1 0 0 0 0 0
4 0 1 0 0 1 0 0
5 0 0 1 0 1 0 0
6 0 1 0 0 0 0 0
如果电影属于那种类型,我想要一个新的列,其中电影类型名称用空格或逗号分隔
Index New column
0 Comedy Drama Family
1 Comedy Family
2 Drama
3 Comedy
4 Comedy Drama
5 Crime Drama
请分享 R 或 Python 中的代码。 感谢您的帮助。
解决方法
Python 中的矩阵乘法:
df.dot(df.columns + " ")
得到
Index
0 Comedy Drama Family
1 Comedy Family
2 Drama
3 Comedy
4 Comedy Drama
5 Crime Drama
6 Comedy
使其更通用:
sep = ","
df.dot(df.columns + sep).str.rstrip(sep)
即,将分隔符添加到列名,执行矩阵向量乘法,然后在末尾右删除分隔符。
,df %>%
apply(1,function(x){which(x == 1)}) %>%
lapply(function(x){
paste(names(x),collapse = " ")
}) %>%
unlist() -> df$your_new_column
,
my.movies <- read.table(text = 'Index Biography Comedy Crime Documentary Drama Family Fantasy
0 0 1 0 0 1 1 0
1 0 1 0 0 0 1 0
2 0 0 0 0 1 0 0
3 0 1 0 0 0 0 0
4 0 1 0 0 1 0 0
5 0 0 1 0 1 0 0
6 0 1 0 0 0 0 0',header = T)
library(tidyverse)
my.movies %>%
pivot_longer(!Index,names_to = 'genre') %>%
filter(value !=0) %>%
group_by(Index) %>%
summarise(genre = toString(genre))
#> # A tibble: 7 x 2
#> Index genre
#> <int> <chr>
#> 1 0 Comedy,Drama,Family
#> 2 1 Comedy,Family
#> 3 2 Drama
#> 4 3 Comedy
#> 5 4 Comedy,Drama
#> 6 5 Crime,Drama
#> 7 6 Comedy
由 reprex package (v2.0.0) 于 2021 年 5 月 30 日创建
,基础 R -
df$new_col <- apply(df,1,function(x) paste0(names(x)[x == 1],collapse = ' '))
dplyr
-
library(dplyr)
df %>%
group_by(Index) %>%
summarise(new_col = paste0(names(.[-1])[cur_data() == 1],collapse = ' '))
# Index new_col
# <int> <chr>
#1 0 Comedy Drama Family
#2 1 Comedy Family
#3 2 Drama
#4 3 Comedy
#5 4 Comedy Drama
#6 5 Crime Drama
#7 6 Comedy
数据
df <- structure(list(Index = 0:6,Biography = c(0L,0L,0L),Comedy = c(1L,1L,1L),Crime = c(0L,Documentary = c(0L,Drama = c(1L,Family = c(1L,Fantasy = c(0L,0L)),class = "data.frame",row.names = c(NA,-7L))
,
基本python代码:
import pandas as pd
df = pd.read_csv('test.csv')
def check_genre(row):
s = ""
if row['biography'] == 1:
s = s + ' biography'
if row['comedy'] == 1:
s = s + ' comedy'
if row['crime'] == 1:
s = s + ' crime'
if row['Documentary'] == 1:
s = s + ' Documentary'
if row['Drama'] == 1:
s = s + ' Drama'
if row['Family'] == 1:
s = s + ' Family'
if row['Fantasy'] == 1:
s = s + ' Fantasy'
return s
df['genre'] = df.apply(lambda row: check_genre(row),axis=1)
print(df)
,
在 Pandas 中,您可以为等于 1 的行值提取索引值,然后将它们转换为字符串:
df.apply(lambda row: " ".join(row[row == 1].index),axis=1)
# Index
# 0 Comedy Drama Family
# 1 Comedy Family
# 2 Drama
# 3 Comedy
# 4 Comedy Drama
# 5 Crime Drama
# 6 Comedy
,
在 R/dplyr 中发布响应
如果“main_df”是您根据第一张图像的 DataFrame。
使数据框更长,以便所有流派列都采用整洁的格式。
group_by
基于索引,因为这是每部电影并使用 paste
main_df%>%
pivot_longer(cols=-index)%>%
filter(value>0)%>% # filter where movie is part of the genre i.e 1
group_by(index)%>%
mutate(new_genre = paste(name,collapse = ","))%>%
ungroup()%>%
distinct(index,new_genre)-> main_df2
# if you want to merge back to the original data frame use left_join
left_join(main_df,main_df2,by="index")
,
减少到一个
- 取消堆叠
- 过滤器
- 聚合
import io
df = pd.read_csv(io.StringIO("""Index Biography Comedy Crime Documentary Drama Family Fantasy
0 0 1 0 0 1 1 0
1 0 1 0 0 0 1 0
2 0 0 0 0 1 0 0
3 0 1 0 0 0 0 0
4 0 1 0 0 1 0 0
5 0 0 1 0 1 0 0
6 0 1 0 0 0 0 0"""),sep="\s+").set_index("Index")
df.unstack().to_frame().loc[lambda d: d[0].eq(1)].reset_index().groupby("Index").agg({"level_0":" ".join})
索引 | level_0 |
---|---|
0 | 喜剧家庭 |
1 | 喜剧家庭 |
2 | 剧情 |
3 | 喜剧 |
4 | 喜剧 |
5 | 犯罪剧 |
6 | 喜剧 |
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。