如何在R中的Quanteda包中应用正则表达式来删除连续重复的tokens(words)

如何解决如何在R中的Quanteda包中应用正则表达式来删除连续重复的tokens(words)

我目前正在从事一个文本挖掘项目，在运行我的 ngrams 模型后，我确实意识到我有重复的单词序列。我想删除重复的单词，同时保留它们的第一次出现。下面的代码演示了我打算做什么。谢谢！


textfun <- "This this this  this analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence"

textfun <- corpus(textfun)

textfuntoks <- tokens(textfun)

textfunRef <- tokens_replace(textfuntoks,pattern = **?**,replacement = **?**,valuetype ="regex")

所需的结果是“此分析应删除所有重复或重复的单词并仅返回它们的第一次出现”。我只对连续重复感兴趣。

我的主要问题是在“tokens_replace”函数中为“模式”和“替换”参数提出值。我尝试了不同的模式，其中一些是从这里的来源改编而来的，但似乎都不起作用。包括问题的图像。[5 克频率分布显示了诸如“swag”、“pleas”、“gas”、“books”、“chicago”、“happi”之类的词的实例] 1

解决方法

您可以拆分每个单词的数据，使用 rle 查找连续出现的位置并将第一个值粘贴在一起。

textfun <- "This this this this analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence"

paste0(rle(tolower(strsplit(textfun,'\\s+')[[1]]))$values,collapse = ' ')

#[1] "this analysis should remove all of the duplicated or repeated words and return only their first occurrence"

有趣的挑战。要在 quanteda 中执行此操作，您可以创建一个字典，将每个重复序列映射到其单个出现处。

import React from "react";
import Testchat from "../components/Testchat";

const Welcome = (props) => {

return (
    <div>
        <h1>Welcome{props.email}</h1>
        <Testchat></Testchat>
    </div>
);
}

export default Welcome

所以这提供了所有（小写）重复值的向量。（为避免小写，请删除 library("quanteda") ## Package version: 3.0.0 ## Unicode version: 10.0 ## ICU version: 61.1 ## Parallel computing: 12 of 12 threads used. ## See https://quanteda.io for tutorials and examples. corp <- corpus("This this this this will analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence") toks <- tokens(corp) ngrams <- tokens_tolower(toks) %>% tokens_ngrams(n = 5:2,concatenator = " ") %>% as.character() # choose only the ngrams that are all the same word ngrams <- ngrams[lengths(sapply(strsplit(ngrams,split = " "),unique,simplify = TRUE)) == 1] # remove duplicates ngrams <- unique(ngrams) head(ngrams,n = 3) ## [1] "all all all all all" "return return return return return" ## [3] "this this this this" 行。）

现在我们创建一个字典，其中每个序列都是一个“值”，每个唯一的标记都是“键”。构建 tokens_tolower() 的列表中将存在多个相同的键，但 dict 构造函数会自动组合它们。创建完成后，可以使用 dictionary() 将序列转换为单个标记。

tokens_lookup()

^{由 reprex package (v1.0.0) 于 2021 年 4 月 8 日创建}