使用 group_by 计算 R 中 2 个数据帧之间的特定单词出现

如何解决使用 group_by 计算 R 中 2 个数据帧之间的特定单词出现

我在 R 中有两个数据框，第一个（命名为 Words）由单列单词组成：

词
你好
建筑
学校
医院
医生

第二个是一个大数据集，如下所示：

id	描述
382	建设学校
787	为新医院招聘医生，为学校招聘教师

然后，我想按ID分组并获得以下结果

id	描述	匹配
382	建设学校	2
787	为新医院招聘医生，为学校招聘教师	3

这是我试过的

library(stringr)

df <- df %>% group_by(df$id)

getCount <- function(data,keyword)
{
  wcount <- str_count(df$description,keyword)
  return(data.frame(data,wcount))
}

gCount(df$description,Words)

（我也尝试过将 Words 数据集转换为列表）

还有：

df <- df %>% group_by(df$id)
table(df$description)

df$match <- df[df$description %in% Words$Words,]
table(df$match)

最后


Words.list <- setNames(split(Words,seq(nrow(Words))),rownames(Words))
description <- subset(df,select = c("description","id"))
description <- description %>% group_by(description$id)
description.list <- setNames(split(description,seq(nrow(description))),rownames(description))

str_to_search = Words.list
str_to_count = description.list

lengths(regmatches(str_to_search,gregexpr(str_to_count,str_to_search,fixed = TRUE)))

然而，我只有一些我不明白的奇怪错误信息。

解决方法

library(stringr)
library(purrr)

words <- c("Hello","Building","School","Hospital","Doctors") %>%
  str_to_lower()
descriptions <- c("Building a school","Hiring doctors for the new hospital and teachers for the school") 

df_descriptions <- data.frame(description = descriptions) %>%
    mutate(Match = map_int(str_to_lower(description),~str_count(.x,words) %>% sum()))

编辑

df_descriptions <- data.frame(description = descriptions) %>%
  mutate(
    Match = str_to_lower(description) %>%
      str_split(" ") %>%
      map_int(~sum(.x %in% words))
  )