微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何使用 R 在我的数据中找到最常见的序列?

如何解决如何使用 R 在我的数据中找到最常见的序列?

我想弄清楚如何使用 rollapply 函数(来自 Zoo 包)来查找数据集中最常见字符串的序列,但我还需要对某些变量(例如日期、行等)

在我进一步讨论之前,值得注意的是,此查询基于我之前在此处发布的一个问题:How can I find most common sequences (of strings) in my data using Tableau?

那里提供的解决方案非常有效,但我现在想将其应用于提供一些新挑战的不同数据集!下面是我在这个新数据集中使用的数据示例:

structure(list(Title = c("Dragons' Den","One Hot Summer","Keeping Faith","Cuckoo","Match of the Day","Sportscene","The Irish League Show","EastEnders","Dragons' Den","Fake or Fortune?","Asian Provocateur","In The Flesh","Two Pints of Lager and a Packet of Crisps","Travels in Trumpland with Ed Balls","Hidden","Train Surfing Wars: A Matter of Life and Death","Bollywood: The World's Biggest Film Industry","The Next Step","Doctor Who Series 11 Trailer","Doctor Who","Picnic at Hanging Rock","Sylvia","Cardinal: Blackfly Season","Age Before Beauty","Stewart Lee's Comedy Vehicle","Holby City","Who Do You Think You Are?","Louis Theroux: Dark States","Louis Theroux","Louis Theroux's Weird Weekends","Cardinal: Blackfly Season"
),Programme_Genre = c("Entertainment","Documentary","Drama","New SeriesComedy","Sport","Entertainment","Comedy","Crime Drama","CBBC","Sci-Fi","Film","On Now","History","Archive","Crime Drama"),Programme_Category = c("Featured","Featured","This Weekend's Football","Most Popular","Box Sets","Doctor Who S1-S10","Drama"),date = c("13/08/2018","13/08/2018","14/08/2018","15/08/2018","15/08/2018"),column = c("1","2","3","4","1","4"),row = c("1","5","5")),row.names = c(NA,-56L),class = "data.frame") 

抱歉,我不太确定共享数据的最佳做法。希望以上工作。它应该看起来像这样:

   Title            Programme_Genre     Programme_Category  date         column row
1   Dragons Den     Entertainment       Featured            13/08/2018      1   1
2  One Hot Summer   Documentary         Featured            13/08/2018      2   1
3  Keeping Faith    Drama               Featured            13/08/2018      3   1
4  Cuckoo           New Series Comedy   Featured            13/08/2018      4   1
5  Match of the Day Sport               This Weekends...    13/08/2018      1   2
6  Sportscene       Sport               This Weekends...    13/08/2018      2   2

我想要做的是使用 rollapply 函数,类似于我在上一个问题中的建议(见上面的链接),但仅用于查找出现在同一日期和特定范围内的序列列。例如,我想知道最常见的流派序列(“Programme_Genre”)是什么,但我只希望 rollapply 函数在每个日期的每一行的第 1-4 列中执行此操作。我敢肯定我没有很好地解释这个(我不是来自数据科学背景,以防你没有猜到)所以我很乐意在必要时详细说明。提前致谢!

解决方法

使用 tidyverse、zoo 和 lubridate,尝试:

library(tidyverse)
library(zoo)
library(lubridate)

df %>% 
  mutate(date = lubridate::dmy(date)) %>% # Optional. Properly parses date as Date class. Makes sorting easier.
  filter(column <= 4) %>% # Step 1. Exclude observations with `column` values above 4.
  group_split(row,date) %>% # Step 2. Splits the DF into smaller DFs representing row and date groups.
  # Step 3 (below). Loops the solution to the previous question,gets a DF,and assigns the date and row signals to each observation.
  map_df(.x = .,.f = ~(rollapply(data = .x$Programme_Genre,3,c) %>% 
                  as_tibble() %>% 
                  mutate(date = unique(.x$date),row = unique(.x$row)))) %>% 
  group_by_all() %>% 
  tally() %>% 
  arrange(date,row,n)

    # A tibble: 26 x 6
# Groups:   V1,V2,V3,date [26]
   V1            V2            V3               date       row       n
   <chr>         <chr>         <chr>            <date>     <chr> <int>
 1 Documentary   Drama         New SeriesComedy 2018-08-13 1         1
 2 Entertainment Documentary   Drama            2018-08-13 1         1
 3 Sport         Sport         Sport            2018-08-13 2         2
 4 Drama         Entertainment Documentary      2018-08-13 3         1
 5 Sport         Drama         Entertainment    2018-08-13 3         1
 6 Comedy        Drama         Comedy           2018-08-13 4         1
 7 Drama         Comedy        Documentary      2018-08-13 4         1
 8 Crime Drama   Documentary   Documentary      2018-08-14 1         1
 9 Documentary   Documentary   Documentary      2018-08-14 1         1
10 Comedy        Drama         Comedy           2018-08-14 2         1
# ... with 16 more rows
,

在这种情况下,我也建议您使用链接问题中建议的类似策略。

首先加载库

library(tidyverse)
library(runner)

n=3的策略

n <- 3

data %>% 
  group_by(date) %>%
  mutate(l_seq = runner(x = Programme_Genre,k = n,function(x) ifelse(length(x) == n,list(x),list(rep(NA,n)))
  )
  ) %>%
  ungroup() %>%
  group_split(date) %>%
  map_df(.,~ map_df(.x$l_seq,~setNames(.x,paste0('Col',seq_len(n)))) %>%
           mutate(date = .x$date) %>% 
           na.omit() %>%
           group_by_all() %>%
           summarise(m = n(),.groups = 'drop') %>%
           filter(m == max(m) & m > 1)
  )

# A tibble: 2 x 5
  Col1   Col2   Col3   date           m
  <chr>  <chr>  <chr>  <chr>      <int>
1 Sport  Sport  Sport  13/08/2018     3
2 Sci-Fi Sci-Fi Sci-Fi 14/08/2018     2

不用说,m 是为您提供该特定日期最大序列数的列

如果n=4,上面的语法给你以下结果

# A tibble: 1 x 6
  Col1  Col2  Col3  Col4  date           m
  <chr> <chr> <chr> <chr> <chr>      <int>
1 Sport Sport Sport Sport 13/08/2018     2

样本中没有长度5长度大于1的序列

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。