微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

根据标题列表创建汇总列

如何解决根据标题列表创建汇总列

我有一个包含调查数据的数据框。它包含其他几个具有人口统计数据的列(例如年龄,部门等)以及带有等级的列。希望根据评分列的计算向数据框添加一些列。

添加列的目的是为了提供a)获得有利答复的数量b)获得有利答复的百分比(没有有利答复/该因素中没有项目)c)获得有利因素的因子水平百分比响应(如果存在任何具有NaN的项目属于因子,则为NaN) 下表显示了如何将其应用于指导因子的示例 希望针对多样性,领导力,敬业度等其他因素进行复制。

Coach_q1  Coach_q2      Coach_q8    coach_favcount   coach_fav_perc   coach_agg_perc
Favourable   Neutral    Favourable   2                  66.6%          66.6%
Favourable   Favourable NaN          2                  100%           NaN
Favourable   Favourable Unfavourable 2                  66.6%          66.6%  
NaN          NaN        Unfavourable 0                  0%             NaN 

我使用了以下代码,但是它只能工作,但是只能获得fav_count列和fav_perc列进行辅导。想要a)获取_agg_perc列,并b)将其应用于所有其他因素。

#Get the Coaching Columns
coaching_agg = df.loc[:,df.columns.str.contains('Coaching_')] 

#Create a column to store the number of favourable responses
df['coaching_fav_count'] = df[coaching_cols == 'Favourable'].notna().sum(axis=1)

#create a column to store the percentage of favourable responses
df['coaching_fav_perc'] = df['coaching_fav'] / len(coaching_agg.columns)

我猜想for循环背后的逻辑是:a)创建一个评级列的列表(请参见下面的代码),以及b)创建一个函数以计算计数,有利响应的百分比,并在以下位置查找NaN的存在项目级别; c)创建一个for循环,以将该功能应用于评分列。

#Create a list made up of rating cols
ratingcollist = ['Coaching_','Communication_','Development_','diversity_','Engagement_']

ratingcols = df.loc[:,df.columns.str.contains('|'.join(ratingcollist))] 

感谢我能获得的任何形式的帮助,谢谢!

解决方法

我认为您需要分别处理列表的每个值:

df = pd.DataFrame({'Coach_q1': ['Favourable','Favourable','nan'],'Coach_q2': ['Neutral','NaN'],'Coach_q8': ['Favourable','nan','Unfavourable','Unfavourable']})
    
print (df)
     Coach_q1    Coach_q2      Coach_q8
0  Favourable     Neutral    Favourable
1  Favourable  Favourable           nan
2  Favourable  Favourable  Unfavourable
3         nan         NaN  Unfavourable

#replace nan and NaN strings to missing values
df = df.replace(['nan',np.nan)

ratingcollist = ['Coach_','Communication_','Development_','Diversity_','Engagement_']

for rat in ratingcollist:
    #filter columns by substrings
    cols = df.filter(like=rat).columns

    #mask for no missing values
    mask = df[cols].notna().all(axis=1)
    
    #create new columns if match
    if len(cols) > 0:
        df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
        df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1)
        df.loc[mask,f'{rat.lower()}agg_perc'] = df.loc[mask,f'{rat.lower()}fav_count'] / len(cols)

print (df)

     Coach_q1    Coach_q2      Coach_q8  coach_fav_count  coach_fav_perc  \
0  Favourable     Neutral    Favourable                2        0.666667   
1  Favourable  Favourable           NaN                2        1.000000   
2  Favourable  Favourable  Unfavourable                2        0.666667   
3         NaN         NaN  Unfavourable                0        0.000000   

   coach_agg_perc  
0        0.666667  
1             NaN  
2        0.666667  
3             NaN  

如果将nan替换为fav_perc的单词丢失输出是错误的,则第二个值应为1,因为count排除了错误的值:

df = pd.DataFrame({'Coach_q1': ['Favourable','Unfavourable']})
    
print (df)
     Coach_q1    Coach_q2      Coach_q8
0  Favourable     Neutral    Favourable
1  Favourable  Favourable           nan
2  Favourable  Favourable  Unfavourable
3         nan         NaN  Unfavourable

df = df.replace(['nan','Missing')
print (df)
     Coach_q1    Coach_q2      Coach_q8
0  Favourable     Neutral    Favourable
1  Favourable  Favourable       Missing
2  Favourable  Favourable  Unfavourable
3     Missing     Missing  Unfavourable

#create a list of all the rating columns
ratingcollist = ['Coach_','Diversity','Leadership','Engagement']


#create a for loop to get all the columns that match the column list keyword
for rat in ratingcollist:
    cols = df.filter(like=rat).columns
    mask = (df[cols] != 'Missing').all(axis=1)
    
#create 3 new columns for each factor,one for count of Favourable responses,#one for percentage of Favourable responses,and one for Factor Level percentage of Favourable responses

    if len(cols) > 0:
        df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
        df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1)
        df.loc[mask,f'{rat.lower()}fav_count'] / len(cols)

print (df)
     Coach_q1    Coach_q2      Coach_q8  coach_fav_count  coach_fav_perc  \
0  Favourable     Neutral    Favourable                2        0.666667   
1  Favourable  Favourable       Missing                2        0.666667   
2  Favourable  Favourable  Unfavourable                2        0.666667   
3     Missing     Missing  Unfavourable                0        0.000000   

   coach_agg_perc  
0        0.666667  
1             NaN  
2        0.666667  
3             NaN  

因此,如果有必要使用Missing,请将count更改为sum,比较不等于Missing

#create a list of all the rating columns
ratingcollist = ['Coach_',and one for Factor Level percentage of Favourable responses

    if len(cols) > 0:
        df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
        df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].ne('Missing').sum(axis=1)
        df.loc[mask,f'{rat.lower()}fav_count'] / len(cols)

print (df)
     Coach_q1    Coach_q2      Coach_q8  coach_fav_count  coach_fav_perc  \
0  Favourable     Neutral    Favourable                2        0.666667   
1  Favourable  Favourable       Missing                2        1.000000   
2  Favourable  Favourable  Unfavourable                2        0.666667   
3     Missing     Missing  Unfavourable                0        0.000000   

   coach_agg_perc  
0        0.666667  
1             NaN  
2        0.666667  
3             NaN  
,

我们可以尝试不使用循环:

columns_split = df.columns.str.split('_')
count = (df.set_axis(pd.MultiIndex.from_tuples(map(tuple,columns_split)),axis=1)
           .stack()
           .eq('Favourable')
           .sum(level=0))

s = columns_split.str[0].to_series().add('_%Fav')

new_df = (df.join(count.add_suffix('_FavCount'))
           .join(count.add_suffix('_%Fav').div(s.value_counts()))
         )

print(new_df)

输出

  Coaching_q1 Coaching_q2 Diversity_q1 Diversity_q2  Coaching_FavCount  \
0  Favourable     Neutral   Favourable   Favourable                1.0   
1  Favourable  Favourable   Favourable    Favourble                2.0   
2         NaN  Favourable          NaN          NaN                1.0   

   Diversity_FavCount  Coaching_%Fav  Diversity_%Fav  
0                 2.0            0.5             1.0  
1                 1.0            1.0             0.5  
2                 0.0            0.5             0.0  
,

问题已解决,方法是将列中的NaN值重新编码为“缺少”并应用@jezrael推荐的掩码

Shell.Current.GoToAsync("leaverequest")

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。