简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd import numpy as np
设定最大列数和最大行数
pd.set_option(‘max_columns‘,8,‘max_rows‘,8)
1 聚合
读取flights数据集,查询头部
flights = pd.read_csv(‘data/flights.csv‘)
flights.head()
MONTH | DAY | WEEKDAY | AIRLINE | ... | SCHED_ARR | ARR_DELAY | diveRTED | CANCELLED | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 4 | WN | ... | 1905 | 65.0 | 0 | 0 |
1 | 1 | 1 | 4 | UA | ... | 1333 | -13.0 | 0 | 0 |
2 | 1 | 1 | 4 | MQ | ... | 1453 | 35.0 | 0 | 0 |
3 | 1 | 1 | 4 | AA | ... | 1935 | -7.0 | 0 | 0 |
4 | 1 | 1 | 4 | WN | ... | 2225 | 39.0 | 0 | 0 |
5 rows × 14 columns
1.1 单列聚合
按照AIRLINE分组,使用agg方法,传入要聚合的列和聚合函数
flights.groupby(‘AIRLINE‘).agg({‘ARR_DELAY‘:‘mean‘}).head()
ARR_DELAY | |
---|---|
AIRLINE | |
AA | 5.542661 |
AS | -0.833333 |
B6 | 8.692593 |
DL | 0.339691 |
EV | 7.034580 |
或者要选取的列使用索引,聚合函数作为字符串传入agg
flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].agg(‘mean‘).head()
AIRLINE AA 5.542661 AS -0.833333 B6 8.692593 DL 0.339691 EV 7.034580 Name: ARR_DELAY,dtype: float64
也可以向agg中传入NumPy的mean函数
flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].agg(np.mean).head()
AIRLINE AA 5.542661 AS -0.833333 B6 8.692593 DL 0.339691 EV 7.034580 Name: ARR_DELAY,dtype: float64
也可以直接使用mean()函数
flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].mean().head()
AIRLINE AA 5.542661 AS -0.833333 B6 8.692593 DL 0.339691 EV 7.034580 Name: ARR_DELAY,dtype: float64
1.2 多列聚合
每家航空公司每周平均每天取消的航班数
flights.groupby([‘AIRLINE‘,‘WEEKDAY‘])[‘CANCELLED‘].agg(‘sum‘).head(7)
AIRLINE WEEKDAY AA 1 41 2 9 3 16 4 20 5 18 6 21 7 29 Name: CANCELLED,dtype: int64
分组可以是多个
选取可以是多组
聚合函数也可以是多个
每周每家航空公司取消或改变航线的航班总数和比例
flights.groupby([‘AIRLINE‘,‘WEEKDAY‘])[‘CANCELLED‘,‘diveRTED‘].agg([‘sum‘,‘mean‘]).head(7)
CANCELLED | diveRTED | ||||
---|---|---|---|---|---|
sum | mean | sum | mean | ||
AIRLINE | WEEKDAY | ||||
AA | 1 | 41 | 0.032106 | 6 | 0.004699 |
2 | 9 | 0.007341 | 2 | 0.001631 | |
3 | 16 | 0.011949 | 2 | 0.001494 | |
4 | 20 | 0.015004 | 5 | 0.003751 | |
5 | 18 | 0.014151 | 1 | 0.000786 | |
6 | 21 | 0.018667 | 9 | 0.008000 | |
7 | 29 | 0.021837 | 1 | 0.000753 |
用列表和嵌套字典对多列分组和聚合
对于每条航线,找到总航班数,取消的数量和比例,飞行时间的平均时间和方差
group_cols = [‘ORG_AIR‘,‘DEST_AIR‘]
agg_dict = {‘CANCELLED‘:[‘sum‘,‘mean‘,‘size‘],‘AIR_TIME‘:[‘mean‘,‘var‘]}
flights.groupby(group_cols).agg(agg_dict).head()
CANCELLED | AIR_TIME | |||||
---|---|---|---|---|---|---|
sum | mean | size | mean | var | ||
ORG_AIR | DEST_AIR | |||||
ATL | ABE | 0 | 0.0 | 31 | 96.387097 | 45.778495 |
ABQ | 0 | 0.0 | 16 | 170.500000 | 87.866667 | |
ABY | 0 | 0.0 | 19 | 28.578947 | 6.590643 | |
ACY | 0 | 0.0 | 6 | 91.333333 | 11.466667 | |
AEX | 0 | 0.0 | 40 | 78.725000 | 47.332692 |
1.3 DataFrameGroupBy对象
groupby方法产生的是一个DataFrameGroupBy对象
college = pd.read_csv(‘data/college.csv‘) grouped = college.groupby([‘STABBR‘,‘RELAFFIL‘])
查看分组对象的类型
type(grouped)
pandas.core.groupby.groupby.DataFrameGroupBy
print([attr for attr in dir(grouped) if not attr.startswith(‘_‘)])
[‘CITY‘,‘CURROPER‘,‘disTANCEONLY‘,‘GRAD_DEBT_MDN_SUPP‘,‘HBCU‘,‘INSTNM‘,‘MD_EARN_WNE_P10‘,‘MENONLY‘,‘PCTFLOAN‘,‘PCTPELL‘,‘PPTUG_EF‘,‘RELAFFIL‘,‘SATMTMID‘,‘SATVRMID‘,‘STABBR‘,‘UG25ABV‘,‘UGDS‘,‘UGDS_2MOR‘,‘UGDS_AIAN‘,‘UGDS_ASIAN‘,‘UGDS_BLACK‘,‘UGDS_HISP‘,‘UGDS_NHPI‘,‘UGDS_NRA‘,‘UGDS_UNKN‘,‘UGDS_WHITE‘,‘WOMENONLY‘,‘agg‘,‘aggregate‘,‘all‘,‘any‘,‘apply‘,‘backfill‘,‘bfill‘,‘Boxplot‘,‘corr‘,‘corrwith‘,‘count‘,‘cov‘,‘cumcount‘,‘cummax‘,‘cummin‘,‘cumprod‘,‘cumsum‘,‘describe‘,‘diff‘,‘dtypes‘,‘expanding‘,‘ffill‘,‘fillna‘,‘filter‘,‘first‘,‘get_group‘,‘groups‘,‘head‘,‘hist‘,‘idxmax‘,‘idxmin‘,‘indices‘,‘last‘,‘mad‘,‘max‘,‘median‘,‘min‘,‘ndim‘,‘ngroup‘,‘ngroups‘,‘nth‘,‘nunique‘,‘ohlc‘,‘pad‘,‘pct_change‘,‘pipe‘,‘plot‘,‘prod‘,‘quantile‘,‘rank‘,‘resample‘,‘rolling‘,‘sem‘,‘shift‘,‘size‘,‘skew‘,‘std‘,‘sum‘,‘tail‘,‘take‘,‘transform‘,‘tshift‘,‘var‘]
grouped.ngroups
112
查看每个分组的唯一识别标签
groups属性是一个字典,包含每个独立分组与行索引标签的对应
groups = list(grouped.groups.keys()) groups[:6]
[(‘AK‘,0),(‘AK‘,1),(‘AL‘,(‘AR‘,1)]
用get_group,传入分组标签的元组
例如,获取佛罗里达州所有与宗教相关的学校
grouped.get_group((‘FL‘,1)).head()
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
712 | The Baptist College of Florida | Graceville | FL | 0.0 | ... | 0.5602 | 0.3531 | 30800 | 20052 |
713 | Barry University | Miami | FL | 0.0 | ... | 0.6733 | 0.4361 | 44100 | 28250 |
714 | Gooding Institute of Nurse Anesthesia | Panama City | FL | 0.0 | ... | NaN | NaN | NaN | PrivacySuppressed |
715 | Bethune-Cookman University | Daytona Beach | FL | 1.0 | ... | 0.8867 | 0.0647 | 29400 | 36250 |
724 | Johnson University Florida | Kissimmee | FL | 0.0 | ... | 0.7384 | 0.2185 | 26300 | 20199 |
5 rows × 27 columns
groupby对象是一个可迭代对象,可以挨个查看每个独立分组
i = 0 for name,group in grouped: print(name) display(group.head(2)) i += 1 if i == 5: break
(‘AK‘,0)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
60 | University of Alaska Anchorage | Anchorage | AK | 0.0 | ... | 0.2647 | 0.4386 | 42500 | 19449.5 |
62 | University of Alaska Fairbanks | Fairbanks | AK | 0.0 | ... | 0.2550 | 0.4519 | 36200 | 19355 |
2 rows × 27 columns
(‘AK‘,1)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
61 | Alaska Bible College | Palmer | AK | 0.0 | ... | 0.2857 | 0.4286 | NaN | PrivacySuppressed |
64 | Alaska Pacific University | Anchorage | AK | 0.0 | ... | 0.5297 | 0.4910 | 47000 | 23250 |
2 rows × 27 columns
(‘AL‘,0)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 rows × 27 columns
(‘AL‘,1)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
2 | Amridge University | Montgomery | AL | 0.0 | ... | 0.7795 | 0.8540 | 40100 | 23370 |
10 | Birmingham Southern College | Birmingham | AL | 0.0 | ... | 0.4809 | 0.0152 | 44200 | 27000 |
2 rows × 27 columns
(‘AR‘,0)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
128 | University of Arkansas at Little Rock | Little Rock | AR | 0.0 | ... | 0.4775 | 0.4062 | 33900 | 21736 |
129 | University of Arkansas for Medical Sciences | Little Rock | AR | 0.0 | ... | 0.6144 | 0.5133 | 61400 | 12500 |
2 rows × 27 columns
groupby对象使用head方法,可以在一个DataFrame钟显示每个分组的头几行
grouped.head(2).head(6)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 | Amridge University | Montgomery | AL | 0.0 | ... | 0.7795 | 0.8540 | 40100 | 23370 |
10 | Birmingham Southern College | Birmingham | AL | 0.0 | ... | 0.4809 | 0.0152 | 44200 | 27000 |
43 | Prince Institute-Southeast | Elmhurst | IL | 0.0 | ... | 0.9375 | 0.6569 | PrivacySuppressed | 20992 |
60 | University of Alaska Anchorage | Anchorage | AK | 0.0 | ... | 0.2647 | 0.4386 | 42500 | 19449.5 |
6 rows × 27 columns
nth方法可以选出每个分组指定行的数据,下面选出的是第1行和最后1行
grouped.nth([1,-1]).head(8)
CITY | CURROPER | disTANCEONLY | GRAD_DEBT_MDN_SUPP | ... | UGDS_NRA | UGDS_UNKN | UGDS_WHITE | WOMENONLY | ||
---|---|---|---|---|---|---|---|---|---|---|
STABBR | RELAFFIL | |||||||||
AK | 0 | Fairbanks | 1 | 0.0 | 19355 | ... | 0.0110 | 0.3060 | 0.4259 | 0.0 |
0 | Barrow | 1 | 0.0 | PrivacySuppressed | ... | 0.0183 | 0.0000 | 0.1376 | 0.0 | |
1 | Anchorage | 1 | 0.0 | 23250 | ... | 0.0000 | 0.0873 | 0.5309 | 0.0 | |
1 | Soldotna | 1 | 0.0 | PrivacySuppressed | ... | 0.0000 | 0.1324 | 0.0588 | 0.0 | |
AL | 0 | Birmingham | 1 | 0.0 | 21941.5 | ... | 0.0179 | 0.0100 | 0.5922 | 0.0 |
0 | Dothan | 1 | 0.0 | PrivacySuppressed | ... | NaN | NaN | NaN | 0.0 | |
1 | Birmingham | 1 | 0.0 | 27000 | ... | 0.0000 | 0.0051 | 0.7983 | 0.0 | |
1 | Huntsville | 1 | NaN | 36173.5 | ... | NaN | NaN | NaN | NaN |
8 rows × 25 columns
2 聚合函数
college = pd.read_csv(‘data/college.csv‘) college.head()
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 | Amridge University | Montgomery | AL | 0.0 | ... | 0.7795 | 0.8540 | 40100 | 23370 |
3 | University of Alabama in Huntsville | Huntsville | AL | 0.0 | ... | 0.4596 | 0.2640 | 45500 | 24097 |
4 | Alabama State University | Montgomery | AL | 1.0 | ... | 0.7554 | 0.1270 | 26600 | 33118.5 |
5 rows × 27 columns
2.1 自定义聚合函数
求出每个州的本科生的平均值和标准差
college.groupby(‘STABBR‘)[‘UGDS‘].agg([‘mean‘,‘std‘]).round(0).head()
mean | std | |
---|---|---|
STABBR | ||
AK | 2493.0 | 4052.0 |
AL | 2790.0 | 4658.0 |
AR | 1644.0 | 3143.0 |
AS | 1276.0 | NaN |
AZ | 4130.0 | 14894.0 |
def max_deviation(s): std_score = (s - s.mean()) / s.std() return std_score.abs().max()
college.groupby(‘STABBR‘)[‘UGDS‘].agg(max_deviation).round(1).head()
STABBR AK 2.6 AL 5.8 AR 6.3 AS NaN AZ 9.9 Name: UGDS,dtype: float64
college.groupby(‘STABBR‘)[‘UGDS‘,‘SATMTMID‘].agg(max_deviation).round(1).head()
UGDS | SATVRMID | SATMTMID | |
---|---|---|---|
STABBR | |||
AK | 2.6 | NaN | NaN |
AL | 5.8 | 1.6 | 1.8 |
AR | 6.3 | 2.2 | 2.3 |
AS | NaN | NaN | NaN |
AZ | 9.9 | 1.9 | 1.4 |
college.groupby([‘STABBR‘,‘RELAFFIL‘])[‘UGDS‘,‘SATMTMID‘].agg([max_deviation,‘std‘]).round(1).head()
UGDS | SATVRMID | SATMTMID | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
max_deviation | mean | std | max_deviation | ... | std | max_deviation | mean | std | ||
STABBR | RELAFFIL | |||||||||
AK | 0 | 2.1 | 3508.9 | 4539.5 | NaN | ... | NaN | NaN | NaN | NaN |
1 | 1.1 | 123.3 | 132.9 | NaN | ... | NaN | NaN | 503.0 | NaN | |
AL | 0 | 5.2 | 3248.8 | 5102.4 | 1.6 | ... | 56.5 | 1.7 | 515.8 | 56.7 |
1 | 2.4 | 979.7 | 870.8 | 1.5 | ... | 53.0 | 1.4 | 485.6 | 61.4 | |
AR | 0 | 5.8 | 1793.7 | 3401.6 | 1.9 | ... | 37.9 | 2.0 | 503.6 | 39.0 |
5 rows × 9 columns
Pandas使用函数名作为返回列的名字;你可以直接使用rename方法修改,或通过__name__属性修改
max_deviation.__name__
‘max_deviation‘
max_deviation.__name__ = ‘Max Deviation‘
college.groupby([‘STABBR‘,‘SATMTMID‘] .agg([max_deviation,‘std‘]).round(1).head()
UGDS | SATVRMID | SATMTMID | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Max Deviation | mean | std | Max Deviation | ... | std | Max Deviation | mean | std | ||
STABBR | RELAFFIL | |||||||||
AK | 0 | 2.1 | 3508.9 | 4539.5 | NaN | ... | NaN | NaN | NaN | NaN |
1 | 1.1 | 123.3 | 132.9 | NaN | ... | NaN | NaN | 503.0 | NaN | |
AL | 0 | 5.2 | 3248.8 | 5102.4 | 1.6 | ... | 56.5 | 1.7 | 515.8 | 56.7 |
1 | 2.4 | 979.7 | 870.8 | 1.5 | ... | 53.0 | 1.4 | 485.6 | 61.4 | |
AR | 0 | 5.8 | 1793.7 | 3401.6 | 1.9 | ... | 37.9 | 2.0 | 503.6 | 39.0 |
5 rows × 9 columns
2.2 用 *args 和 **kwargs 自定义聚合函数
自定义一个返回去本科生人数在1000和3000之间的比例的函数
def pct_between_1_3k(s): return s.between(1000,3000).mean()
用州和宗教分组,再聚合
college.groupby([‘STABBR‘,‘RELAFFIL‘])[‘UGDS‘].agg(pct_between_1_3k).head(9)
STABBR RELAFFIL AK 0 0.142857 1 0.000000 AL 0 0.236111 1 0.333333 ... AR 1 0.111111 AS 0 1.000000 AZ 0 0.096774 1 0.000000 Name: UGDS,Length: 9,dtype: float64
def pct_between(s,low,high): return s.between(low,high).mean()
college.groupby([‘STABBR‘,‘RELAFFIL‘])[‘UGDS‘].agg(pct_between,1000,10000).head(9)
STABBR RELAFFIL AK 0 0.428571 1 0.000000 AL 0 0.458333 1 0.375000 ... AR 1 0.166667 AS 0 1.000000 AZ 0 0.233871 1 0.111111 Name: UGDS,dtype: float64
显示指定最大和最小值
college.groupby([‘STABBR‘,high=10000,low=1000).head(9)
STABBR RELAFFIL AK 0 0.428571 1 0.000000 AL 0 0.458333 1 0.375000 ... AR 1 0.166667 AS 0 1.000000 AZ 0 0.233871 1 0.111111 Name: UGDS,dtype: float64
也可以关键字参数和非关键字参数混合使用,只要非关键字参数在后面
college.groupby([‘STABBR‘,high=10000).head(9)
STABBR RELAFFIL AK 0 0.428571 1 0.000000 AL 0 0.458333 1 0.375000 ... AR 1 0.166667 AS 0 1.000000 AZ 0 0.233871 1 0.111111 Name: UGDS,dtype: float64
Pandas不支持多重聚合时,使用参数
def make_agg_func(func,name,*args,**kwargs): def wrapper(x): return func(x,**kwargs) wrapper.__name__ = name return wrapper
my_agg1 = make_agg_func(pct_between,‘pct_1_3k‘,low=1000,high=3000)
college.groupby([‘STABBR‘,‘RELAFFIL‘])[‘UGDS‘].agg([my_agg1,make_agg_func(pct_between,‘pct_10_30k‘,10000,30000)])
pct_1_3k | pct_10_30k | ||
---|---|---|---|
STABBR | RELAFFIL | ||
AK | 0 | 0.142857 | 0.142857 |
1 | 0.000000 | 0.000000 | |
AL | 0 | 0.236111 | 0.083333 |
1 | 0.333333 | 0.000000 | |
... | ... | ... | ... |
WI | 1 | 0.360000 | 0.000000 |
WV | 0 | 0.246154 | 0.015385 |
1 | 0.375000 | 0.000000 | |
WY | 0 | 0.545455 | 0.000000 |
112 rows × 2 columns
3 聚合后去除多级索引
读取数据
flights = pd.read_csv(‘data/flights.csv‘)
flights.head()
MONTH | DAY | WEEKDAY | AIRLINE | ... | SCHED_ARR | ARR_DELAY | diveRTED | CANCELLED | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 4 | WN | ... | 1905 | 65.0 | 0 | 0 |
1 | 1 | 1 | 4 | UA | ... | 1333 | -13.0 | 0 | 0 |
2 | 1 | 1 | 4 | MQ | ... | 1453 | 35.0 | 0 | 0 |
3 | 1 | 1 | 4 | AA | ... | 1935 | -7.0 | 0 | 0 |
4 | 1 | 1 | 4 | WN | ... | 2225 | 39.0 | 0 | 0 |
5 rows × 14 columns
按‘AIRLINE‘,‘WEEKDAY‘分组,分别对disT和ARR_DELAY聚合
airline_info = flights.groupby([‘AIRLINE‘,‘WEEKDAY‘]) .agg({‘disT‘:[‘sum‘,‘mean‘],‘ARR_DELAY‘:[‘min‘,‘max‘]}) .astype(int) airline_info.head()
disT | ARR_DELAY | ||||
---|---|---|---|---|---|
sum | mean | min | max | ||
AIRLINE | WEEKDAY | ||||
AA | 1 | 1455386 | 1139 | -60 | 551 |
2 | 1358256 | 1107 | -52 | 725 | |
3 | 1496665 | 1117 | -45 | 473 | |
4 | 1452394 | 1089 | -46 | 349 | |
5 | 1427749 | 1122 | -41 | 732 |
行和列都有两级索引
3.1 拼接列索引
get_level_values(0)取出第一级索引
level0 = airline_info.columns.get_level_values(0)
get_level_values(1)取出第二级索引
level1 = airline_info.columns.get_level_values(1)
一级和二级索引拼接成新的列索引
airline_info.columns = level0 + ‘_‘ + level1
airline_info.head(7)
disT_sum | disT_mean | ARR_DELAY_min | ARR_DELAY_max | ||
---|---|---|---|---|---|
AIRLINE | WEEKDAY | ||||
AA | 1 | 1455386 | 1139 | -60 | 551 |
2 | 1358256 | 1107 | -52 | 725 | |
3 | 1496665 | 1117 | -45 | 473 | |
4 | 1452394 | 1089 | -46 | 349 | |
5 | 1427749 | 1122 | -41 | 732 | |
6 | 1265340 | 1124 | -50 | 858 | |
7 | 1461906 | 1100 | -49 | 626 |
3.2 重置行索引
reset_index()可以将行索引变成单级
airline_info.reset_index().head(7)
AIRLINE | WEEKDAY | disT_sum | disT_mean | ARR_DELAY_min | ARR_DELAY_max | |
---|---|---|---|---|---|---|
0 | AA | 1 | 1455386 | 1139 | -60 | 551 |
1 | AA | 2 | 1358256 | 1107 | -52 | 725 |
2 | AA | 3 | 1496665 | 1117 | -45 | 473 |
3 | AA | 4 | 1452394 | 1089 | -46 | 349 |
4 | AA | 5 | 1427749 | 1122 | -41 | 732 |
5 | AA | 6 | 1265340 | 1124 | -50 | 858 |
6 | AA | 7 | 1461906 | 1100 | -49 | 626 |
Pandas默认会在分组运算后,将所有分组的列放在索引中,as_index设为False可以避免这么做。
分组后使用reset_index,也可以达到同样的效果
flights.groupby([‘AIRLINE‘],as_index=False)[‘disT‘].agg(‘mean‘).round(0)
AIRLINE | disT | |
---|---|---|
0 | AA | 1114.0 |
1 | AS | 1066.0 |
2 | B6 | 1772.0 |
3 | DL | 866.0 |
... | ... | ... |
10 | UA | 1231.0 |
11 | US | 1181.0 |
12 | VX | 1240.0 |
13 | WN | 810.0 |
14 rows × 2 columns
4 过滤聚合
college = pd.read_csv(‘data/college.csv‘,index_col=‘INSTNM‘) grouped = college.groupby(‘STABBR‘) grouped.ngroups
59
这等于求出不同州的个数,nunique()可以得到同样的结果
college[‘STABBR‘].nunique()
59
自定义一个计算少数民族学生总比例的函数,如果比例大于阈值,还返回True
def check_minority(df,threshold): minority_pct = 1 - df[‘UGDS_WHITE‘] total_minority = (df[‘UGDS‘] * minority_pct).sum() total_ugds = df[‘UGDS‘].sum() total_minority_pct = total_minority / total_ugds return total_minority_pct > threshold
grouped变量有一个filter方法,可以接收一个自定义函数,决定是否保留一个分组
college_filtered = grouped.filter(check_minority,threshold=.5)
college_filtered.head()
CITY | STABBR | HBCU | MENONLY | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
INSTNM | |||||||||
Everest College-Phoenix | Phoenix | AZ | 0.0 | 0.0 | ... | 0.7151 | 0.6700 | 28600 | 9500 |
Collins College | Phoenix | AZ | 0.0 | 0.0 | ... | 0.8228 | 0.4764 | 25700 | 47000 |
Empire Beauty School-Paradise Valley | Phoenix | AZ | 0.0 | 0.0 | ... | 0.5873 | 0.4651 | 17800 | 9588 |
Empire Beauty School-Tucson | Tucson | AZ | 0.0 | 0.0 | ... | 0.6615 | 0.4229 | 18200 | 9833 |
Thunderbird School of Global Management | GlenDale | AZ | 0.0 | 0.0 | ... | 0.0000 | 0.0000 | 118900 | PrivacySuppressed |
5 rows × 26 columns
通过查看形状,可以看到过滤了60%,只有20个州的少数学生占据多数
college.shape
(7535,26)
college_filtered.shape
(3028,26)
college_filtered[‘STABBR‘].nunique()
20
用一些不同的阈值,检查形状和不同州的个数
college_filtered_20 = grouped.filter(check_minority,threshold=.2)
college_filtered_20.shape,college_filtered_20[‘STABBR‘].nunique()
((7461,26),57)
college_filtered_70 = grouped.filter(check_minority,threshold=.7)
college_filtered_70.shape,college_filtered_70[‘STABBR‘].nunique()
((957,10)
college_filtered_95 = grouped.filter(check_minority,threshold=.95)
college_filtered_95.shape,college_filtered_95[‘STABBR‘].nunique()
((156,7)
5 apply函数
读取college,‘UGDS‘,‘SATVRMID‘三列如果有缺失值则删除行
college = pd.read_csv(‘data/college.csv‘) subset = [‘UGDS‘,‘SATVRMID‘] college2 = college.dropna(subset=subset) college.shape,college2.shape
((7535,27),(1184,27))
5.1 apply与agg
def weighted_math_average(df): weighted_math = df[‘UGDS‘] * df[‘SATMTMID‘] return int(weighted_math.sum() / df[‘UGDS‘].sum())
5.1.1 apply应用聚合函数
college2.groupby(‘STABBR‘).apply(weighted_math_average).head()
STABBR AK 503 AL 536 AR 529 AZ 569 CA 564 dtype: int64
5.1.2 agg应用聚合函数
college2.groupby(‘STABBR‘).agg(weighted_math_average).head()
INSTNM | CITY | HBCU | MENONLY | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
STABBR | |||||||||
AK | 503 | 503 | 503 | 503 | ... | 503 | 503 | 503 | 503 |
AL | 536 | 536 | 536 | 536 | ... | 536 | 536 | 536 | 536 |
AR | 529 | 529 | 529 | 529 | ... | 529 | 529 | 529 | 529 |
AZ | 569 | 569 | 569 | 569 | ... | 569 | 569 | 569 | 569 |
CA | 564 | 564 | 564 | 564 | ... | 564 | 564 | 564 | 564 |
5 rows × 26 columns
如果将列限制到SATMTMID,会报错。这是因为不能访问UGDS。
# college2.groupby(‘STABBR‘)[‘SATMTMID‘].agg(weighted_math_average)
5.2 apply创建新列
apply的一个不错的功能是通过返回Series,创建多个新的列
from collections import OrderedDict def weighted_average(df): data = OrderedDict() weight_m = df[‘UGDS‘] * df[‘SATMTMID‘] weight_v = df[‘UGDS‘] * df[‘SATVRMID‘] data[‘weighted_math_avg‘] = weight_m.sum() / df[‘UGDS‘].sum() data[‘weighted_verbal_avg‘] = weight_v.sum() / df[‘UGDS‘].sum() data[‘math_avg‘] = df[‘SATMTMID‘].mean() data[‘verbal_avg‘] = df[‘SATVRMID‘].mean() data[‘count‘] = len(df) return pd.Series(data,dtype=‘int‘)
college2.groupby(‘STABBR‘).apply(weighted_average).head(10)
weighted_math_avg | weighted_verbal_avg | math_avg | verbal_avg | count | |
---|---|---|---|---|---|
STABBR | |||||
AK | 503 | 555 | 503 | 555 | 1 |
AL | 536 | 533 | 504 | 508 | 21 |
AR | 529 | 504 | 515 | 491 | 16 |
AZ | 569 | 557 | 536 | 538 | 6 |
... | ... | ... | ... | ... | ... |
CT | 545 | 533 | 522 | 517 | 14 |
DC | 621 | 623 | 588 | 589 | 6 |
DE | 569 | 553 | 495 | 486 | 3 |
FL | 565 | 565 | 521 | 529 | 38 |
10 rows × 5 columns
5.3 apply创建dataframe
自定义一个返回DataFrame的函数
使用NumPy的函数average计算加权平均值,使用SciPy的gmean和hmean计算几何和调和平均值
from scipy.stats import gmean,hmean def calculate_means(df): df_means = pd.DataFrame(index=[‘Arithmetic‘,‘Weighted‘,‘Geometric‘,‘Harmonic‘]) cols = [‘SATMTMID‘,‘SATVRMID‘] for col in cols: arithmetic = df[col].mean() weighted = np.average(df[col],weights=df[‘UGDS‘]) geometric = gmean(df[col]) harmonic = hmean(df[col]) df_means[col] = [arithmetic,weighted,geometric,harmonic] df_means[‘count‘] = len(df) return df_means.astype(int)
college2.groupby(‘STABBR‘) .filter(lambda x: len(x) != 1) .groupby(‘STABBR‘) .apply(calculate_means).head(10)
SATMTMID | SATVRMID | count | ||
---|---|---|---|---|
STABBR | ||||
AL | Arithmetic | 504 | 508 | 21 |
Weighted | 536 | 533 | 21 | |
Geometric | 500 | 505 | 21 | |
Harmonic | 497 | 502 | 21 | |
... | ... | ... | ... | ... |
AR | Geometric | 514 | 489 | 16 |
Harmonic | 513 | 487 | 16 | |
AZ | Arithmetic | 536 | 538 | 6 |
Weighted | 569 | 557 | 6 |
10 rows × 3 columns
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。