微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

用python中的列循环总结几列 输出

如何解决用python中的列循环总结几列 输出

我有一个非常奇怪的调查数据结构,如下例所示。在调查期间,我们收集了每个家庭的智能手机数量,然后收集有关使用每台设备进行特定活动的人数的信息。

示例:F3_{智能手机号码}_{HH_member_id} 所以 F3_1_4 将是 F3 & {第一部家用智能手机}=1 & {Number of Household_member_using/sharing this device = 4}

或者如果家里有 3 个成员剪一个设备,F3_1_1、F3_1_2、F3_1_3 将为 1。

我正在尝试取出单个设备并计算用于该活动的电话数量以及数量。这是我的尝试

df_ph = pd.DataFrame()

   
for h in range(1,5):

  df_shared_ph = pd.DataFrame(None)

  for i in range(1,15):
    
    df_temp_ph = df[["respid","f3_" + str(h) + "_" + str(i)]].copy()
    df_temp_ph.rename(columns = {"f3_" + str(h) + "_" + str(i): "Smartph"},inplace = True)
    df_shared_ph = pd.concat([df_shared_ph,df_temp_ph],axis=0).dropna(subset=["Smartph"])

  df_shared_ph = df_shared_ph.groupby(['respid']).agg({'Smartph': 'sum'}).reset_index()
  df_ph = pd.concat([df_ph,df_shared_ph],axis=0)

  print("used for X and by how many:\n" + str(df_ph['Smartph'].value_counts()))

我的代码段工作正常,但由于某种原因它会在我的原始数据中复制许多行/id,我不知道为什么。我在这里错过了什么吗?有没有其他方法可以做到这一点?

df_ph.duplicated(['respid']).sum() == 0
False

示例数据:

# output to a dict
# the dict can be converted to a dataframe with 
# df = pd.DataFrame.from_dict(d,orient='index')  # d is the name of the dict


 {0: {'f3_1_1': 1.0,'f3_1_10': nan,'f3_1_11': nan,'f3_1_12': nan,'f3_1_13': nan,'f3_1_14': nan,'f3_1_15': nan,'f3_1_2': 0.0,'f3_1_3': 0.0,'f3_1_4': 0.0,'f3_1_5': nan,'f3_1_6': nan,'f3_1_7': nan,'f3_1_8': nan,'f3_1_9': nan,'f3_2_1': 0.0,'f3_2_10': nan,'f3_2_11': nan,'f3_2_12': nan,'f3_2_13': nan,'f3_2_14': nan,'f3_2_15': nan,'f3_2_2': 1.0,'f3_2_3': 0.0,'f3_2_4': 0.0,'f3_2_5': nan,'f3_2_6': nan,'f3_2_7': nan,'f3_2_8': nan,'f3_2_9': nan,'f3_3_1': 0.0,'f3_3_10': nan,'f3_3_11': nan,'f3_3_12': nan,'f3_3_13': nan,'f3_3_14': nan,'f3_3_15': nan,'f3_3_2': 0.0,'f3_3_3': 1.0,'f3_3_4': 0.0,'f3_3_5': nan,'f3_3_6': nan,'f3_3_7': nan,'f3_3_8': nan,'f3_3_9': nan,'f3_4_1': 0.0,'f3_4_10': nan,'f3_4_11': nan,'f3_4_12': nan,'f3_4_13': nan,'f3_4_14': nan,'f3_4_15': nan,'f3_4_2': 0.0,'f3_4_3': 0.0,'f3_4_4': 1.0,'f3_4_5': nan,'f3_4_6': nan,'f3_4_7': nan,'f3_4_8': nan,'f3_4_9': nan,'f3_5_1': nan,'f3_5_10': nan,'f3_5_11': nan,'f3_5_12': nan,'f3_5_13': nan,'f3_5_14': nan,'f3_5_15': nan,'f3_5_2': nan,'f3_5_3': nan,'f3_5_4': nan,'f3_5_5': nan,'f3_5_6': nan,'f3_5_7': nan,'f3_5_8': nan,'f3_5_9': nan,'respid': 13766.0},1: {'f3_1_1': nan,'f3_1_2': nan,'f3_1_3': nan,'f3_1_4': nan,'f3_2_1': nan,'f3_2_2': nan,'f3_2_3': nan,'f3_2_4': nan,'f3_3_1': nan,'f3_3_2': nan,'f3_3_3': nan,'f3_3_4': nan,'f3_4_1': nan,'f3_4_2': nan,'f3_4_3': nan,'f3_4_4': nan,'respid': 16346.0},2: {'f3_1_1': 1.0,'respid': 11293.0},3: {'f3_1_1': nan,'respid': 15965.0},4: {'f3_1_1': 1.0,'respid': 7110.0}}

解决方法

很明显,您已经对多索引列进行了编码。您可以按如下方式解码。

df = pd.DataFrame.from_dict(d,orient='index').set_index("respid")  # d is the name of the dict
# remove redundant "f3_" from column name
df = df.rename(columns={c:c[3:] for c in df.columns if c.startswith("f3_")})

# F3_{smartphone number}_{HH_member_id}
# make columns a multiindex
df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in df.columns],names=["smartphone_no","household_id"])
# now its simple to work with DF
df.stack()

输出

smartphone_no           1    2    3    4   5
respid  household_id                        
13766.0 1             1.0  0.0  0.0  0.0 NaN
        2             0.0  1.0  0.0  0.0 NaN
        3             0.0  0.0  1.0  0.0 NaN
        4             0.0  0.0  0.0  1.0 NaN
11293.0 1             1.0  0.0  NaN  NaN NaN
        2             0.0  1.0  NaN  NaN NaN
7110.0  1             1.0  0.0  0.0  NaN NaN
        2             0.0  1.0  0.0  NaN NaN
        3             0.0  0.0  1.0  NaN NaN

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。