微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

stm包中的fitNewDocuments之后的新docnum列

如何解决stm包中的fitNewDocuments之后的新docnum列

我有5354条新闻文章的语料库,里面有很多重复的文章。使用stm包,我为906篇独特的文章运行了stm模型,并使用alignCorpus和fitNewDocuments将模型应用到语料库的其余部分。然后,我使用make.dt制作数据表,以生成整个语料库的theta值。此过程创建了一个新列,称为“ docnum”。我希望这是分配给每个文档的一列单独的数字,但是,它包括数字1-906,并且每个数字出现5-6次并且对应于相同的theta值集(请参见屏幕截图)。我认为它不应该这样做,但是我不明白为什么会这样。对于alignCorpus和fitNewDocuments函数(stm包的一部分),我在Internet上找不到很多帮助,因此,我很感谢您对这里可能发生的事情的任何想法或建议。很难为这种情况提供一个可重现的示例,因此下面提供了我的整个过程代码和所得excel文档的屏幕截图。

temp <- textProcessor(documents = NCA4_Data_3$text[1:906],Metadata = NCA4_Data_3[1:906,],lowercase = FALSE,removestopwords = TRUE,removenumbers = TRUE,removepunctuation = TRUE,ucp = TRUE,stem = FALSE,wordLengths = c(3,Inf),sparselevel = 1,language = "en",verbose = TRUE,onlycharacter = FALSE,striphtml = TRUE,customstopwords = 
                             c("https","ads","info","privacy","com","gov","via","email","print","embedded","said","will","says","can","like","also","photo","photograph","video","credit","sen","rep","dr","mr","ms","mrs","professor","prof"),v1 = FALSE)

out <- prepDocuments(temp$documents,temp$vocab,temp$Meta,lower.thresh = 1,upper.thresh = 815,subsample = NULL,verbose = TRUE)

STM.17 <- stm(documents = out$documents,vocab = out$vocab,K = 17,data = out$Meta,prevalence = ~media_type,max.em.its = 1000,init.type = "Spectral",verbose = TRUE)

#Now we process the remaining documents
temp <- textProcessor(documents = NCA4_Data_3$text[907:nrow(NCA4_Data_3)],Metadata = NCA4_Data_3[907:nrow(NCA4_Data_3),])

#note we don't run prepCorpus here because we don't want to drop any words- we want 
#every word that showed up in the old documents.
newdocs <- alignCorpus(new = temp,old.vocab = STM.17$vocab)

#we get some helpful Feedback on what has been retained and lost in the print out.
#and Now we can fit our new held-out documents
fitNewDocuments(model = STM.17,documents = newdocs$documents,newData = newdocs$Meta,origData = out$Meta,prevalencePrior="Covariate")

# #Export excel with theta values
stm.17.datatable <- make.dt(STM.17,Meta = NCA4_Data_3)
view(stm.17.datatable)
write.xlsx(stm.17.datatable,"~/Desktop/Oct.23.2020/stm.17.datatable.xlsx")

enter image description here

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?