根据值合并列名以创建另一列

如何解决根据值合并列名以创建另一列

我有一个包含各种电影类型以及电影是否属于该类型的电影数据集。例如

Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0

如果电影属于那种类型，我想要一个新的列，其中电影类型名称用空格或逗号分隔

Index  New column
0    Comedy Drama Family
1    Comedy Family
2    Drama
3    Comedy
4    Comedy Drama
5    Crime Drama

请分享 R 或 Python 中的代码。感谢您的帮助。

解决方法

Python 中的矩阵乘法：

df.dot(df.columns + " ")

得到

Index
0    Comedy Drama Family
1          Comedy Family
2                  Drama
3                 Comedy
4           Comedy Drama
5            Crime Drama
6                 Comedy

使其更通用：

sep = ","
df.dot(df.columns + sep).str.rstrip(sep)

即，将分隔符添加到列名，执行矩阵向量乘法，然后在末尾右删除分隔符。

df %>%
  apply(1,function(x){which(x == 1)}) %>% 
  lapply(function(x){
    paste(names(x),collapse = " ")
    }) %>%
  unlist() -> df$your_new_column

my.movies <- read.table(text = 'Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0',header = T)
library(tidyverse)
my.movies %>%
  pivot_longer(!Index,names_to = 'genre') %>%
  filter(value !=0) %>%
  group_by(Index) %>%
  summarise(genre = toString(genre))
#> # A tibble: 7 x 2
#>   Index genre                
#>   <int> <chr>                
#> 1     0 Comedy,Drama,Family
#> 2     1 Comedy,Family       
#> 3     2 Drama                
#> 4     3 Comedy               
#> 5     4 Comedy,Drama        
#> 6     5 Crime,Drama         
#> 7     6 Comedy

^{由 reprex package (v2.0.0) 于 2021 年 5 月 30 日创建}

基础 R -

df$new_col <- apply(df,1,function(x) paste0(names(x)[x == 1],collapse = ' '))

dplyr -

library(dplyr)

df %>%
  group_by(Index) %>%
  summarise(new_col = paste0(names(.[-1])[cur_data() == 1],collapse = ' '))

#  Index new_col            
#  <int> <chr>              
#1     0 Comedy Drama Family
#2     1 Comedy Family      
#3     2 Drama              
#4     3 Comedy             
#5     4 Comedy Drama       
#6     5 Crime Drama        
#7     6 Comedy

数据

df <- structure(list(Index = 0:6,Biography = c(0L,0L,0L),Comedy = c(1L,1L,1L),Crime = c(0L,Documentary = c(0L,Drama = c(1L,Family = c(1L,Fantasy = c(0L,0L)),class = "data.frame",row.names = c(NA,-7L))

基本python代码：

import pandas as pd
df = pd.read_csv('test.csv')

def check_genre(row):
    s = ""
    if row['biography'] == 1:
        s = s + ' biography'
    if row['comedy'] == 1:
        s = s + ' comedy'
    if row['crime'] == 1:
        s = s + ' crime'
    if row['Documentary'] == 1:
        s = s + ' Documentary'
    if row['Drama'] == 1:
        s = s + ' Drama'
    if row['Family'] == 1:
        s = s + ' Family'
    if row['Fantasy'] == 1:
        s = s + ' Fantasy'

    return s

df['genre'] = df.apply(lambda row: check_genre(row),axis=1)

print(df)

在 Pandas 中，您可以为等于 1 的行值提取索引值，然后将它们转换为字符串：

df.apply(lambda row: " ".join(row[row == 1].index),axis=1)

# Index
# 0    Comedy Drama Family
# 1          Comedy Family
# 2                  Drama
# 3                 Comedy
# 4           Comedy Drama
# 5            Crime Drama
# 6                 Comedy

在 R/dplyr 中发布响应

如果“main_df”是您根据第一张图像的 DataFrame。使数据框更长，以便所有流派列都采用整洁的格式。 group_by 基于索引，因为这是每部电影并使用 paste

折叠流派列

main_df%>%
  pivot_longer(cols=-index)%>%
  filter(value>0)%>% # filter where movie is part of the genre i.e 1
  group_by(index)%>%
  mutate(new_genre = paste(name,collapse = ","))%>%
  ungroup()%>%
  distinct(index,new_genre)-> main_df2

# if you want to merge back to the original data frame use left_join

left_join(main_df,main_df2,by="index")

减少到一个

取消堆叠
过滤器
聚合

import io

df = pd.read_csv(io.StringIO("""Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0"""),sep="\s+").set_index("Index")

df.unstack().to_frame().loc[lambda d: d[0].eq(1)].reset_index().groupby("Index").agg({"level_0":" ".join})

索引	level_0
0	喜剧家庭
1	喜剧家庭
2	剧情
3	喜剧
4	喜剧
5	犯罪剧
6	喜剧

根据值合并列名以创建另一列

如何解决根据值合并列名以创建另一列

解决方法

相关推荐