如何解决使用 dplyr 和 lubridate 按月和年组合计数和分组
我有一个数据框,其中每一行代表一个城市中发生的一个事件。数据框显示城市名称和发生日期,如下所示:
df <- data.frame(city = c("Seattle","Seattle","NYC","Chicago","Chicago"),date_of_event = c("01/13/2011","01/17/2011","03/15/2011","05/21/2011","05/23/2011","01/20/2011","01/22/2011","03/23/2011","01/18/2011","02/24/2011","02/26/2011","04/30/2011","06/18/2011"),stringsAsFactors = FALSE)
df$date_of_event <- as.Date(df$date_of_event,"%m/%d/%Y")
以上只是一个例子,我的数据实际上是在一个有数千行、许多城市、许多日期等的 csv 中。我想做的是生成一个新的数据框,每个城市和每个月都有一行/year 表示在数据集中,以及相应的计数列,显示原始数据框中每个城市每个月发生的事件次数。第二个数据框看起来像这样:
df2 <- data.frame(city = c("Seattle",month_year = c("01/01/2011","02/01/2011","03/01/2011","04/01/2011","05/01/2011","06/01/2011","01/01/2011","06/01/2011"),count = c(2,1,2,1),stringsAsFactors = FALSE)
df2$month_year <- as.Date(df2$month_year,"%m/%d/%Y")
我知道您可以使用 dplyr 中的计数,也可以使用 lubridate 将日期四舍五入到每个月的第一天,但我已经尝试并未能正确进行分组和计数以生成我想要的第二个数据帧.
解决方法
你可以试试这个:
library(tidyverse)
library(lubridate)
df3 <- df %>% mutate(new_date = floor_date(date_of_event,"month"))
tt <- as.data.frame(table(df3[-2]))
tt[order(desc(tt$city),tt$new_date),]
city new_date Freq
Seattle 2011-01-01 2
Seattle 2011-02-01 0
Seattle 2011-03-01 1
Seattle 2011-04-01 0
Seattle 2011-05-01 2
Seattle 2011-06-01 0
NYC 2011-01-01 2
NYC 2011-02-01 0
NYC 2011-03-01 1
NYC 2011-04-01 0
NYC 2011-05-01 0
NYC 2011-06-01 0
Chicago 2011-01-01 1
Chicago 2011-02-01 2
Chicago 2011-03-01 0
Chicago 2011-04-01 1
Chicago 2011-05-01 0
Chicago 2011-06-01 1
要包括零计数的延长时间,您可以尝试以下操作:
# assign a name to the output obtained previously
df4 <- tt[order(desc(tt$city),]
a <- mdy("01/01/11") # starting period
b <- a + months(0:92) # period sequence
df5 <- expand.grid(city = c("Chicago","Seattle","NYC"),new_date = as.factor(b))
df6 <- setdiff(df5,df4[-3])
df6$Freq <- 0 # assign zero count
df7 <- rbind(df4,df6)
df8 <- df7[order(df7$city,df7$new_date),]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。