如何解决python中多级分类数据的描述性统计
下面是一个包含三列的 df 示例,每列都有多级分类数据。我想计算列中每个级别的三列的一些描述性统计数据 - 例如每个位置和状态中每个年龄组的人数,包括计数、比例和标准差(我认为这实际上应该是一个置信区间这里)。但我不确定如何以优雅的方式做到这一点。任何建议都非常感谢,非常感谢
birth_year = pd.DataFrame(([random.randint(1900,2000) for x in range(50)]),columns = ['year'])
from datetime import date
def age(df,col):
today = date.today()
age = today.year - df[col]
bins = [18,30,40,50,60,70,120]
labs = ['-30','30-39','40-49','50-59','60-69','70+']
group = pd.cut(age,bins,labels = labs)
return(group)
birth_year.loc[:,'age_bin'] = age(birth_year,'year')
location = pd.DataFrame((Rand(1,6,50)),columns = ['location'])
def label_loc (row):
if row['location'] == 1 :
return 'england'
if row['location'] == 2 :
return 'ireland'
if row['location'] == 3:
return 'scotland'
if row['location'] == 4:
return 'wales'
if row['location'] == 5:
return 'jersey'
if row['location'] == 6:
return 'gurnsey'
return 'Other'
location = location.apply(lambda row: label_loc(row),axis=1)
def Rand(start,end,num):
out = []
for x in range(num):
out.append(random.randint(start,end))
return out
status = pd.DataFrame((Rand(1,columns = ['status'])
def label_stat (row):
if row['status'] == 1 :
return 'married'
if row['status'] == 2 :
return 'divorced'
if row['status'] == 3:
return 'single'
if row['status'] == 4:
return 'window'
return 'Other'
status = status.apply(lambda row: label_stat(row),axis=1)
df = pd.DataFrame(list(zip(birth_year["age_bin"],status,location)),columns =['year','gender','ethnicity'])
解决方法
(请参阅 this gist 以了解稍微重写的示例设置。)
举个例子:
每个位置和状态每个年龄段的人数
如果您有一个连续变量,例如 year
,您可以简单地告诉 groupby().agg()
哪个是您想要的平均统计数据:
print(df.groupby(['location','status'])['year'].agg(['mean','std']))
mean std
location status
england Other 1961.000000 16.792856
divorced 1934.666667 30.270998
married 1917.000000 NaN
single 1907.000000 NaN
window 1962.600000 34.011763
ireland Other 1982.000000 NaN
divorced 1949.750000 37.303932
married 1991.000000 NaN
single 1986.500000 2.121320
window 1965.500000 3.535534
jersey Other 1939.800000 26.204961
divorced 1984.000000 NaN
married 1986.000000 NaN
single 1942.500000 54.447222
scotland Other 1942.666667 12.701706
divorced 1946.000000 49.497475
married 1914.000000 NaN
single 1968.000000 NaN
window 1933.500000 24.748737
wales Other 1950.666667 39.526363
divorced 1978.000000 NaN
married 1959.000000 52.325902
single 1929.000000 NaN
window 1990.000000 NaN
对于分类值,您可以使用 value_counts()
对它们进行计数,这会增加一个额外的索引级别(您可以取消堆叠):
grouped_age_bin = df.groupby(['location','status'])['age_bin']
counts = grouped_age_bin.value_counts().unstack('age_bin')
print(counts)
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0 1 0 1 0 2
divorced 0 0 0 1 0 2
married 0 0 0 0 0 1
single 0 0 0 0 0 1
window 0 1 2 1 0 1
ireland Other 0 1 0 0 0 0
divorced 1 0 0 0 1 2
married 1 0 0 0 0 0
single 0 2 0 0 0 0
window 0 0 0 2 0 0
jersey Other 0 0 1 0 1 3
divorced 0 1 0 0 0 0
married 0 1 0 0 0 0
single 0 1 0 0 0 1
scotland Other 0 0 0 0 0 3
divorced 0 1 0 0 0 1
married 0 0 0 0 0 1
single 0 0 0 1 0 0
window 0 0 0 0 1 1
wales Other 0 1 0 1 0 1
divorced 0 0 1 0 0 0
married 1 0 0 0 0 1
single 0 0 0 0 0 1
window 0 1 0 0 0 0
如果您想要每个类别的平均值,您可以除以组大小,即 grouped_age_bin.size()
:
print(counts.div(grouped_age_bin.size(),axis='index'))
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.500000 0.0 0.000000 0.000000 0.00 0.500000
single 0.000000 0.0 0.000000 0.000000 0.00 1.000000
window 0.250000 0.0 0.000000 0.000000 0.25 0.500000
ireland Other 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.000000 0.0 0.000000 0.000000 0.00 1.000000
single 0.000000 0.0 0.000000 0.000000 1.00 0.000000
window 0.000000 0.0 0.333333 0.333333 0.00 0.333333
jersey Other 0.000000 0.0 1.000000 0.000000 0.00 0.000000
divorced 0.000000 0.0 1.000000 0.000000 0.00 0.000000
married 0.000000 0.0 0.000000 0.000000 0.00 1.000000
single 0.000000 0.0 0.200000 0.400000 0.20 0.200000
window 0.000000 0.5 0.000000 0.000000 0.00 0.500000
scotland divorced 0.333333 0.0 0.000000 0.000000 0.00 0.666667
married 0.000000 0.0 0.333333 0.333333 0.00 0.333333
single 0.000000 0.5 0.000000 0.000000 0.00 0.500000
window 0.000000 0.0 0.500000 0.000000 0.00 0.500000
wales Other 0.000000 0.5 0.000000 0.000000 0.00 0.500000
divorced 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.500000 0.0 0.000000 0.000000 0.00 0.500000
single 0.000000 0.0 0.000000 0.000000 0.00 1.000000
window 0.500000 0.0 0.500000 0.000000 0.00 0.000000
现在有了总体规模和总数,您可以计算置信区间。或者您可以进行简单的字符串聚合。要同时拥有人口规模和总数,我将使用 pd.DataFrame.transform
+ pd.Series.combine
,这样您只需编写一个包含类别中的数字和总数的 lambda:
print(counts.transform(pd.Series.combine,'index',grouped_age_bin.size(),lambda num,tot: f'{100 * num / tot:.1f}% (n={num})'))
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
married 33.3% (n=1) 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 33.3% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
ireland Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 50.0% (n=1)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2) 0.0% (n=0)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
jersey Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
scotland Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
window 25.0% (n=1) 0.0% (n=0) 0.0% (n=0) 25.0% (n=1) 0.0% (n=0) 50.0% (n=2)
wales Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
divorced 16.7% (n=1) 0.0% (n=0) 33.3% (n=2) 0.0% (n=0) 0.0% (n=0) 50.0% (n=3)
married 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
single 0.0% (n=0) 0.0% (n=0) 33.3% (n=1) 33.3% (n=1) 0.0% (n=0) 33.3% (n=1)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。