如何解决如何使用 R 在我的数据中找到最常见的序列?
我想弄清楚如何使用 rollapply
函数(来自 Zoo
包)来查找数据集中最常见字符串的序列,但我还需要对某些变量(例如日期、行等)
在我进一步讨论之前,值得注意的是,此查询基于我之前在此处发布的一个问题:How can I find most common sequences (of strings) in my data using Tableau?
那里提供的解决方案非常有效,但我现在想将其应用于提供一些新挑战的不同数据集!下面是我在这个新数据集中使用的数据示例:
structure(list(Title = c("Dragons' Den","One Hot Summer","Keeping Faith","Cuckoo","Match of the Day","Sportscene","The Irish League Show","EastEnders","Dragons' Den","Fake or Fortune?","Asian Provocateur","In The Flesh","Two Pints of Lager and a Packet of Crisps","Travels in Trumpland with Ed Balls","Hidden","Train Surfing Wars: A Matter of Life and Death","Bollywood: The World's Biggest Film Industry","The Next Step","Doctor Who Series 11 Trailer","Doctor Who","Picnic at Hanging Rock","Sylvia","Cardinal: Blackfly Season","Age Before Beauty","Stewart Lee's Comedy Vehicle","Holby City","Who Do You Think You Are?","Louis Theroux: Dark States","Louis Theroux","Louis Theroux's Weird Weekends","Cardinal: Blackfly Season"
),Programme_Genre = c("Entertainment","Documentary","Drama","New SeriesComedy","Sport","Entertainment","Comedy","Crime Drama","CBBC","Sci-Fi","Film","On Now","History","Archive","Crime Drama"),Programme_Category = c("Featured","Featured","This Weekend's Football","Most Popular","Box Sets","Doctor Who S1-S10","Drama"),date = c("13/08/2018","13/08/2018","14/08/2018","15/08/2018","15/08/2018"),column = c("1","2","3","4","1","4"),row = c("1","5","5")),row.names = c(NA,-56L),class = "data.frame")
抱歉,我不太确定共享数据的最佳做法。希望以上工作。它应该看起来像这样:
Title Programme_Genre Programme_Category date column row
1 Dragons Den Entertainment Featured 13/08/2018 1 1
2 One Hot Summer Documentary Featured 13/08/2018 2 1
3 Keeping Faith Drama Featured 13/08/2018 3 1
4 Cuckoo New Series Comedy Featured 13/08/2018 4 1
5 Match of the Day Sport This Weekends... 13/08/2018 1 2
6 Sportscene Sport This Weekends... 13/08/2018 2 2
我想要做的是使用 rollapply
函数,类似于我在上一个问题中的建议(见上面的链接),但仅用于查找出现在同一日期和特定范围内的序列列。例如,我想知道最常见的流派序列(“Programme_Genre”)是什么,但我只希望 rollapply
函数在每个日期的每一行的第 1-4 列中执行此操作。我敢肯定我没有很好地解释这个(我不是来自数据科学背景,以防你没有猜到)所以我很乐意在必要时详细说明。提前致谢!
解决方法
使用 tidyverse、zoo 和 lubridate,尝试:
library(tidyverse)
library(zoo)
library(lubridate)
df %>%
mutate(date = lubridate::dmy(date)) %>% # Optional. Properly parses date as Date class. Makes sorting easier.
filter(column <= 4) %>% # Step 1. Exclude observations with `column` values above 4.
group_split(row,date) %>% # Step 2. Splits the DF into smaller DFs representing row and date groups.
# Step 3 (below). Loops the solution to the previous question,gets a DF,and assigns the date and row signals to each observation.
map_df(.x = .,.f = ~(rollapply(data = .x$Programme_Genre,3,c) %>%
as_tibble() %>%
mutate(date = unique(.x$date),row = unique(.x$row)))) %>%
group_by_all() %>%
tally() %>%
arrange(date,row,n)
# A tibble: 26 x 6
# Groups: V1,V2,V3,date [26]
V1 V2 V3 date row n
<chr> <chr> <chr> <date> <chr> <int>
1 Documentary Drama New SeriesComedy 2018-08-13 1 1
2 Entertainment Documentary Drama 2018-08-13 1 1
3 Sport Sport Sport 2018-08-13 2 2
4 Drama Entertainment Documentary 2018-08-13 3 1
5 Sport Drama Entertainment 2018-08-13 3 1
6 Comedy Drama Comedy 2018-08-13 4 1
7 Drama Comedy Documentary 2018-08-13 4 1
8 Crime Drama Documentary Documentary 2018-08-14 1 1
9 Documentary Documentary Documentary 2018-08-14 1 1
10 Comedy Drama Comedy 2018-08-14 2 1
# ... with 16 more rows
,
在这种情况下,我也建议您使用链接问题中建议的类似策略。
首先加载库
library(tidyverse)
library(runner)
说n=3
的策略
n <- 3
data %>%
group_by(date) %>%
mutate(l_seq = runner(x = Programme_Genre,k = n,function(x) ifelse(length(x) == n,list(x),list(rep(NA,n)))
)
) %>%
ungroup() %>%
group_split(date) %>%
map_df(.,~ map_df(.x$l_seq,~setNames(.x,paste0('Col',seq_len(n)))) %>%
mutate(date = .x$date) %>%
na.omit() %>%
group_by_all() %>%
summarise(m = n(),.groups = 'drop') %>%
filter(m == max(m) & m > 1)
)
# A tibble: 2 x 5
Col1 Col2 Col3 date m
<chr> <chr> <chr> <chr> <int>
1 Sport Sport Sport 13/08/2018 3
2 Sci-Fi Sci-Fi Sci-Fi 14/08/2018 2
不用说,m
是为您提供该特定日期最大序列数的列
如果n=4
,上面的语法给你以下结果
# A tibble: 1 x 6
Col1 Col2 Col3 Col4 date m
<chr> <chr> <chr> <chr> <chr> <int>
1 Sport Sport Sport Sport 13/08/2018 2
样本中没有长度5
长度大于1的序列
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。