R Tm 包字典匹配导致比文本实际词更高的频率

如何解决R Tm 包字典匹配导致比文本实际词更高的频率

我一直在使用下面的代码将文本作为语料库加载并使用 tm 包来清理文本。作为下一步，我正在加载字典并清理它。然后我将文本中的单词与字典进行匹配以计算分数。然而，匹配的结果比文本中的实际单词数更高（例如，能力得分为 1500，但文本中的实际单词数仅为 1000）。

我认为这与文本和词典的词干提取有关，因为在不进行词干提取时匹配度较低。

您知道为什么会发生这种情况吗？

非常感谢。

R 代码

步骤 1 将数据存储为语料库

file.path <- file.path(here("Generated Files","Data Preparation")) corpus <- Corpus(Dirsource(file.path))

第 2 步清理数据

#Removing special characters
toSpace <- content_transformer(function (x,pattern ) gsub(pattern," ",x))
corpus <- tm_map(corpus,toSpace,"/")
corpus <- tm_map(corpus,"@")
corpus <- tm_map(corpus,"\\|") 

#Convert the text to lower case
corpus <- tm_map(corpus,content_transformer(tolower))
#Remove numbers
corpus <- tm_map(corpus,removeNumbers)
#Remove english common stopwords
corpus <- tm_map(corpus,removeWords,stopwords("english"))
#Remove your own stop word
specify your stopwords as a character vector
corpus <- tm_map(corpus,c("view","pdf")) 
#Remove punctuations
corpus <- tm_map(corpus,removePunctuation)
#Eliminate extra white spaces
corpus <- tm_map(corpus,stripwhitespace)
#Text stemming
corpus <- tm_map(corpus,stemDocument)
#Unique words
corpus <- tm_map(corpus,unique)

第 3 步 DTM

dtm <- DocumentTermMatrix(corpus)

第 4 步加载字典

dic.competence <- read_excel(here("Raw Data","6. Dictionaries","Brand.xlsx"))
dic.competence <- tolower(dic.competence$COMPETENCE)
dic.competence <- stemDocument(dic.competence)
dic.competence <- unique(dic.competence)

第 5 步计算频率

corpus.terms = colnames(dtm)
competence = match(corpus.terms,dic.competence,nomatch=0)

第 6 步计算分数

competence.score = sum(competence) / rowSums(as.matrix(dtm))
competence.score.df = data.frame(scores = competence.score)

解决方法

运行该行时 competence 返回什么？我不确定你的字典是如何设置的，所以我不能肯定那里发生了什么。我引入了我自己的随机语料库文本作为主要文本，并引入了一个单独的语料库作为字典，您的代码运行良好。 competence.score.df 的行名是我语料库中不同 txt 文件的名称，分数都在 0-1 范围内。

# this is my 'dictionary' of terms:
tdm <- TermDocumentMatrix(Corpus(DirSource("./corpus/corpus3")),control = list(removeNumbers = TRUE,stopwords = TRUE,stemming = TRUE,removePunctuation = TRUE))

# then I used your programming and it worked as I think you were expecting

# notice what I used here for the dictionary    
(competence = match(colnames(dtm),Terms(tdm)[1:10],# I only used the first 10 in my test of your code
                    nomatch = 0))

(competence.score = sum(competence)/rowSums(as.matrix(dtm)))
(competence.score.df = data.frame(scores = competence.score))

R Tm 包字典匹配导致比文本实际词更高的频率

如何解决R Tm 包字典匹配导致比文本实际词更高的频率

解决方法

相关推荐