如何解决没有 NA 的所有数据集之间的公共周期
对于 6 个月内每 10 秒测量一次的 3 个数据框,我想比较这些数据框,但问题是它们在这 6 个月的不同时间包含许多缺失值的差距。 现在,我试图找到一种方法来比较这 3 个数据帧,以便 ** 找到 3 个数据帧之间没有缺失值的共同周期。 **所以我想知道确切存在哪些日期和时间所有数据框中的数据,以便提取这些数据并继续我的分析。
举个例子,这里是一个输入数据
go
和#df1
date V1
2010-02-01 00:00:00 15278
2010-02-01 00:00:10 15257
2010-02-01 00:00:20 15273
2010-02-01 00:00:30 15386
2010-02-01 00:00:40 15333
2010-02-01 00:00:50 15360
2010-02-01 00:01:00 17357
2010-02-01 00:01:10 na
2010-02-01 00:01:20 na
2010-02-01 00:01:30 na
2010-02-01 00:01:40 na
2010-02-01 00:01:50 14214
2010-02-01 00:02:00 na
2010-02-01 00:02:10 14233
2010-02-01 00:02:20 14183
2010-02-01 00:02:30 14100
2010-02-01 00:02:40 14070
2010-02-01 00:02:50 na
...
df2
对于#df2
date V2
2010-02-01 00:00:00 15
2010-02-01 00:00:10 12
2010-02-01 00:00:20 13
2010-02-01 00:00:30 16
2010-02-01 00:00:40 13
2010-02-01 00:00:50 15
2010-02-01 00:01:00 17
2010-02-01 00:01:10 na
2010-02-01 00:01:20 na
2010-02-01 00:01:30 na
2010-02-01 00:01:40 na
2010-02-01 00:01:50 16
2010-02-01 00:02:00 na
2010-02-01 00:02:10 14
2010-02-01 00:02:20 11
2010-02-01 00:02:30 10
2010-02-01 00:02:40 13
2010-02-01 00:02:50 17
...
df3
并且输出结果必须是
#df3
date V3
2010-02-01 00:00:00 11278
2010-02-01 00:00:10 11257
2010-02-01 00:00:20 11273
2010-02-01 00:00:30 12386
2010-02-01 00:00:40 13333
2010-02-01 00:00:50 na
2010-02-01 00:01:00 11357
2010-02-01 00:01:10 na
2010-02-01 00:01:20 na
2010-02-01 00:01:30 na
2010-02-01 00:01:40 na
2010-02-01 00:01:50 12542
2010-02-01 00:02:00 na
2010-02-01 00:02:10 na
2010-02-01 00:02:20 13183
2010-02-01 00:02:30 14100
2010-02-01 00:02:40 18850
2010-02-01 00:02:50 14770
...
解决方法
我认为您可以使用以下操作。以下是可读格式的数据。
df1 <- tibble::tribble(
~date,~V1,"2010-02-01 00:00:00",15278,"2010-02-01 00:00:10",15257,"2010-02-01 00:00:20",15273,"2010-02-01 00:00:30",15386,"2010-02-01 00:00:40",15333,"2010-02-01 00:00:50",15360,"2010-02-01 00:01:00",17357,"2010-02-01 00:01:10",NA,"2010-02-01 00:01:20","2010-02-01 00:01:30","2010-02-01 00:01:40","2010-02-01 00:01:50",14214,"2010-02-01 00:02:00","2010-02-01 00:02:10",14233,"2010-02-01 00:02:20",14183,"2010-02-01 00:02:30",14100,"2010-02-01 00:02:40",14070,"2010-02-01 00:02:50",NA)
df2 <- tibble::tribble(
~date,~V2,15,12,13,16,17,14,11,10,17)
df3 <- tibble::tribble(
~date,~ V3,11278,11257,11273,12386,13333,11357,12542,13183,18850,14770)
首先,您可以确保日期采用适当的日期格式。
df1 <- df1 %>% mutate(date = lubridate::ymd_hms(date))
df2 <- df2 %>% mutate(date = lubridate::ymd_hms(date))
df3 <- df3 %>% mutate(date = lubridate::ymd_hms(date))
保存原始数据框以备后用:
df1_orig <- df1
df2_orig <- df2
df3_orig <- df3
然后,listwise 删除所有数据
df1 <- na.omit(df1)
df2 <- na.omit(df2)
df3 <- na.omit(df3)
接下来,您需要 inner_join()
,因为它只保留两个数据集共有的观察结果。
df_all <- inner_join(df1,df2)
df_all <- inner_join(df_all,df3)
现在,df_all
只有三个数据集共有的完整数据。然后您可以获取日期的滞后(上次观察)并评估它是否比当前观察早 10 秒,在这种情况下,cont
值将为 0,或者如果距离超过 10 秒,则cont
变量将为 1。通过取 cont
变量的累积和,它将识别数据中不同的连续观察组。
df_all <- df_all %>%
mutate(lag_date = lag(date),cont = as.numeric(lag_date != (date - lubridate::hms("00:00:10"))),cont = ifelse(is.na(cont),1,cont),group = cumsum(cont))
最后,您可以通过 group
变量进行分组,然后找到每个组内 date
的最小值和最大值。
res <- df_all %>% group_by(group) %>%
summarise(start = min(date),end = max(date))
res
#
# # A tibble: 4 x 3
# group start end
# * <dbl> <dttm> <dttm>
# 1 1 2010-02-01 00:00:00 2010-02-01 00:00:40
# 2 2 2010-02-01 00:01:00 2010-02-01 00:01:00
# 3 3 2010-02-01 00:01:50 2010-02-01 00:01:50
# 4 4 2010-02-01 00:02:20 2010-02-01 00:02:40
我知道你有很多数据,所以希望这会足够快。我的经验是 dplyr
函数似乎比它们的基本 R 对应函数更好地扩展,所以希望这里的情况也是如此。
编辑:过滤原始数据
要过滤原始数据以仅包含这些时间,您可以执行以下操作:
keep_times <- res %>%
rowwise %>%
mutate(date = list(seq(from=start,to=end,by=lubridate:::hms("00:00:10")))) %>%
unnest(date) %>%
ungroup %>%
select(date)
上面的代码在每行中从开始时间到结束时间制作了一个 10 秒间隔的序列。然后它取消嵌套列表,然后它只保留序列。然后,您可以将其 left_join 到原始数据中:
d1 <- left_join(keep_times,df1_orig)
d2 <- left_join(keep_times,df2_orig)
d3 <- left_join(keep_times,df3_orig)
结果如下:
d1
# # A tibble: 10 x 2
# date V1
# <dttm> <dbl>
# 1 2010-02-01 00:00:00 15278
# 2 2010-02-01 00:00:10 15257
# 3 2010-02-01 00:00:20 15273
# 4 2010-02-01 00:00:30 15386
# 5 2010-02-01 00:00:40 15333
# 6 2010-02-01 00:01:00 17357
# 7 2010-02-01 00:01:50 14214
# 8 2010-02-01 00:02:20 14183
# 9 2010-02-01 00:02:30 14100
# 10 2010-02-01 00:02:40 14070
d2
# # A tibble: 10 x 2
# date V2
# <dttm> <dbl>
# 1 2010-02-01 00:00:00 15
# 2 2010-02-01 00:00:10 12
# 3 2010-02-01 00:00:20 13
# 4 2010-02-01 00:00:30 16
# 5 2010-02-01 00:00:40 13
# 6 2010-02-01 00:01:00 17
# 7 2010-02-01 00:01:50 16
# 8 2010-02-01 00:02:20 11
# 9 2010-02-01 00:02:30 10
# 10 2010-02-01 00:02:40 13
d3
# # A tibble: 10 x 2
# date V3
# <dttm> <dbl>
# 1 2010-02-01 00:00:00 11278
# 2 2010-02-01 00:00:10 11257
# 3 2010-02-01 00:00:20 11273
# 4 2010-02-01 00:00:30 12386
# 5 2010-02-01 00:00:40 13333
# 6 2010-02-01 00:01:00 11357
# 7 2010-02-01 00:01:50 12542
# 8 2010-02-01 00:02:20 13183
# 9 2010-02-01 00:02:30 14100
# 10 2010-02-01 00:02:40 18850
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。