排序/选择唯一的和最新的数据使用所有规则排序后最后只取first()

如何解决排序/选择唯一的和最新的数据使用所有规则排序后最后只取first()

我正在尝试从我的数据中获取最相关的价值。我想出使用 Pandas 中的 max 和 min 函数来获取最旧和最近的日期，但我找不到其余的修复程序。我试图从我的数据集中取出一家独特的公司和产品，并根据这些点获取其余的数据。如果有人能告诉我 Python 中用于解决此类问题的工具集，或有关如何在 Python 中解决此类问题的指导，那将非常有帮助。

对于security_level，superservere>严重> 中等> 物质> 轻微
对于评级，如果我们看到同一家公司和产品同时获得真实和虚假，则为真实
对于 rating_level，关键 > 高 > 中 > 低
对于 first_level，是最早的日期，对于 last_release，是最近的日期
score，同类产品和公司中的最高分

公司	产品	security_level	评分	rating_level	first_release	last_release	得分
谷歌	手机	次要	真的	关键	04/11/2020	03/17/2020	0.5
谷歌	操作系统	中等	错误	中等	09/05/2019	03/20/2021	0.009
谷歌	操作系统	次要	错误	低	09/04/2019	05/11/2020	19
谷歌	电视	严重	真的	高	08/11/2020	03/04/2021
谷歌	手机	超级严重	错误	中等	04/06/2015	08/19/2020	2.4
谷歌	手机	次要	错误	高	08/08/2019	08/19/2020	1.3
苹果	iphone	次要	真的	低	02/03/2020	10/13/2020	3
苹果	iphone	材料	真的	中等	01/21/2018	03/04/2021	6
苹果	iwatch	材料	错误	低	04/11/2015	08/13/2020	8
苹果	iphone	材料	真的	中等	10/20/2020	03/19/2021	5
戴尔	笔记本电脑	次要	错误	低	01/05/2021	03/20/2021	1

输出：

公司	产品	security_level	评分	rating_level	first_release	last_release	得分
谷歌	手机	超级严重	真的	关键	04/06/2015	08/19/2020	2.4
谷歌	操作系统	中等	错误	中等	09/04/2019	03/20/2021	19
谷歌	电视	严重	真的	高	08/11/2020	03/04/2021
苹果	iphone	材料	真的	中等	01/21/2018	03/19/2021	6
苹果	iwatch	材料	错误	低	04/11/2015	08/13/2020	8
戴尔	笔记本电脑	次要	错误	低	01/05/2021	03/20/2021	1

解决方法

将 dtype 和 first_release 列的 last_release 更改为 datetime

df['last_release']  = pd.to_datetime(df['last_release'])
df['first_release'] = pd.to_datetime(df['first_release'])

将列 security_level 和 rating_level 转换为 ordered categorical 类型

df['rating_level'] = pd.Categorical(df['rating_level'],['low','medium','high','critical'],ordered=True)
df['security_level'] = pd.Categorical(df['security_level'],['minor','material','moderate','severe','supersevere'],ordered=True)

Group 列 company 和 product 上的数据框，并使用 agg_dict 中指定的相应聚合函数聚合剩余的列

agg_dict = {'security_level': 'max','rating': 'max','rating_level': 'max','first_release': 'min','last_release': 'max','score': 'max'}
            
out = df.groupby(['company','product'],as_index=False,sort=False).agg(agg_dict)

结果

>>> out

  company product security_level  rating rating_level first_release last_release  score
0  google  mobile    supersevere    True     critical    2015-04-06   2020-08-19    2.4
1  google      os       moderate   False       medium    2019-09-04   2021-03-20   19.0
2  google      tv         severe    True         high    2020-08-11   2021-03-04    NaN
3   apple  iphone       material    True       medium    2018-01-21   2021-03-19    6.0
4   apple  iwatch       material   False          low    2015-04-11   2020-08-13    8.0
5    dell  laptop          minor   False          low    2021-01-05   2021-03-20    1.0

正如您的问题所述，排序，然后选择每组的第一行。您已定义按类别排序

nan = np.nan
df = pd.DataFrame({'company': ['google','google','apple','dell'],'product': ['mobile','os','tv','mobile','iphone','iwatch','laptop'],'security_level': ['minor','minor','supersevere','minor'],'rating': [True,False,True,False],'rating_level': ['critical','low','low'],'first_release': ['04/11/2020','09/05/2019','09/04/2019','08/11/2020','04/06/2015','08/08/2019','02/03/2020','01/21/2018','04/11/2015','10/20/2020','01/05/2021'],'last_release': ['03/17/2020','03/20/2021','05/11/2020','03/04/2021','08/19/2020','10/13/2020','08/13/2020','03/19/2021','03/20/2021'],'score': [0.5,0.009,19.0,nan,2.4,1.3,3.0,6.0,8.0,5.0,1.0]})

# fix data types of columns.  Categoricals for sort orders
df.first_release = pd.to_datetime(df.first_release)
df.last_release = pd.to_datetime(df.last_release)
df.security_level = pd.Categorical(df.security_level,['supersevere',ordered=True)
df.rating_level = pd.Categorical(df.rating_level,['critical',],ordered=True)

dfs = df.sort_values(['company','product','security_level','rating','rating_level','first_release','last_release'],ascending=[1,1,1])

使用所有规则排序后

	公司	产品	security_level	评分	rating_level	first_release	last_release	得分
9	苹果	iphone	材质	真	中	2020-10-20 00:00:00	2021-03-19 00:00:00	5
7	苹果	iphone	材质	真	中	2018-01-21 00:00:00	2021-03-04 00:00:00	6
6	苹果	iphone	次要	真	低	2020-02-03 00:00:00	2020-10-13 00:00:00	3
8	苹果	iwatch	材质	假	低	2015-04-11 00:00:00	2020-08-13 00:00:00	8
10	戴尔	笔记本电脑	次要	假	低	2021-01-05 00:00:00	2021-03-20 00:00:00	1
4	google	手机	超级严重	假	中	2015-04-06 00:00:00	2020-08-19 00:00:00	2.4
0	google	手机	次要	真	关键	2020-04-11 00:00:00	2020-03-17 00:00:00	0.5
5	google	手机	次要	假	高	2019-08-08 00:00:00	2020-08-19 00:00:00	1.3
1	google	os	中等	假	中	2019-09-05 00:00:00	2021-03-20 00:00:00	0.009
2	google	os	次要	假	低	2019-09-04 00:00:00	2020-05-11 00:00:00	19
3	google	电视	严重	真	高	2020-08-11 00:00:00	2021-03-04 00:00:00	nan
	公司	产品	security_level	评分	rating_level	first_release	last_release	得分

最后只取`first()`

dfs.groupby(["company","product"],as_index=False).first()

	公司	产品	security_level	评分	rating_level	first_release	last_release	得分
0	苹果	iphone	材质	真	中	2020-10-20 00:00:00	2021-03-19 00:00:00	5
1	苹果	iwatch	材质	假	低	2015-04-11 00:00:00	2020-08-13 00:00:00	8
2	戴尔	笔记本电脑	次要	假	低	2021-01-05 00:00:00	2021-03-20 00:00:00	1
3	google	手机	超级严重	假	中	2015-04-06 00:00:00	2020-08-19 00:00:00	2.4
4	google	os	中等	假	中	2019-09-05 00:00:00	2021-03-20 00:00:00	0.009
5	google	电视	严重	真	高	2020-08-11 00:00:00	2021-03-04 00:00:00	nan

排序/选择唯一的和最新的数据 使用所有规则排序后最后只取first()

如何解决排序/选择唯一的和最新的数据 使用所有规则排序后最后只取first()

解决方法

使用所有规则排序后

最后只取first()

排序/选择唯一的和最新的数据使用所有规则排序后最后只取first()

如何解决排序/选择唯一的和最新的数据使用所有规则排序后最后只取first()

最后只取`first()`