稀疏度为0％的DocumentTermMatrix

如何解决稀疏度为0％的DocumentTermMatrix

我正在尝试从一本意大利语书中获取文档术语矩阵。我有这本书的pdf文件，我写了几行代码：

#install.packages("pdftools")
library(pdftools)
library(tm)
text <- pdf_text("IoRobot.pdf")
# collapse pdf pages into 1
text <- paste(unlist(text),collapse ="")
myCorpus <- VCorpus(VectorSource(text))
mydtm <-DocumentTermMatrix(myCorpus,control = list(removeNumbers = TRUE,removePunctuation = TRUE,stopwords=stopwords("it"),stemming=TRUE))
inspect(mydtm)

我在最后一行之后获得的结果是：

<<DocumentTermMatrix (documents: 1,terms: 10197)>>
Non-/sparse entries: 10197/0
Sparsity           : 0%
Maximal term length: 39
weighting          : term frequency (tf)
Sample             :
    Terms
Docs calvin cosa donovan esser piú poi powel prima quando robot
   1    201  191     254   193 288 211   287   166    184   62

我注意到稀疏度为0％。这正常吗？

解决方法

是的，这似乎是正确的。
document term matrix是一个矩阵，其中文档作为行，术语作为列，如果术语在文档中的行（1）或不行（0），则为0或1。
稀疏度是指示文档术语矩阵中“ 0的数量”的指标。
您可以定义一个稀疏术语，当它不在文档中时，可以从here中查找。

要了解这些要点，让我们看一个可重复的示例，该示例会创建与您的情况类似的情况：

library(tm)
text <- c("here some text")
corpus <- VCorpus(VectorSource(text))
DTM <- DocumentTermMatrix(corpus)
DTM

<<DocumentTermMatrix (documents: 1,terms: 3)>>
Non-/sparse entries: 3/0
Sparsity           : 0%
Maximal term length: 4
Weighting          : term frequency (tf)

查看输出，我们可以看到您有一个文档（因此具有该语料库的DTM由一行组成）。
看一下：

as.matrix(DTM)
    Terms
Docs here some text
   1    1    1    1

现在更容易理解输出了：

您有一个带有树项的文档：

>
您的非稀疏（即!= 0 in DTM）是3，而sparse == 0是：

非/稀疏条目：3/0

因此，您的稀疏度为== 0%，因为一个文档语料库中不能有0。每个术语都属于唯一文档，因此您将拥有所有术语：

  Sparsity           : 0%

看看另一个例子，它的术语稀疏：

text <- c("here some text","other text")

corpus <- VCorpus(VectorSource(text))
DTM <- DocumentTermMatrix(corpus)

DTM
<<DocumentTermMatrix (documents: 2,terms: 4)>>
Non-/sparse entries: 5/3
Sparsity           : 38%
Maximal term length: 5
Weighting          : term frequency (tf)

as.matrix(DTM)
    Terms
Docs here other some text
   1    1     0    1    1
   2    0     1    0    1

现在您有3个稀疏项（3/5），如果您执行3/8 = 0.375，即稀疏度的38％。

稀疏度为0％的DocumentTermMatrix

如何解决稀疏度为0％的DocumentTermMatrix

解决方法

相关推荐