Doc2Vec build_vocab 方法失败

如何解决Doc2Vec build_vocab 方法失败

我正在关注 this guide 构建 doc2vec gensim 模型。

我创建了一个 MRE 来强调这个问题:

import pandas as pd,numpy as np,warnings,nltk,string,re,gensim
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from nltk.corpus import stopwords
from nltk.stem.snowball import snowballstemmer
from sklearn.model_selection import train_test_split
from gensim.models import doc2vec
from gensim.models.doc2vec import TaggedDocument

def get_words(para):   
    pattern = '([\d]|[\d][\d])\/([\d]|[\d][\d]\/([\d]{4}))'
    stop_words = set(stopwords.words('english'))
    stemmer = snowballstemmer('english')
    no_dates = [re.sub(pattern,'',i) for i in para.lower().split()]
    no_punctuation = [nopunc.translate(str.maketrans('',string.punctuation)) for nopunc in no_dates]
    stemmed_tokens = [stemmer.stem(word) for word in no_punctuation if word.strip() and len(word) > 1 and word not in stop_words]
    
    return stemmed_tokens

data_dict = {'ID': {0: 1,1: 2,2: 3,3: 4,4: 5,5: 6,6: 7,7: 8,8: 9,9: 10},'Review': {0: "Even though the restauraunt was gross,the food was still good and I'd recommend it",1: 'My waiter was awful,my food was awful,I hate it all',2: 'I did not enjoy the food very much but I thought the waitstaff was fantastic',3: 'Even though the cleanliness level was fantastic,my food was awful',4: 'Everything was mediocre,but I guess mediocre is better than bad Nowadays',5: "Honestly there wasn't a single thing that was mediocre about this place",6: 'I Could not have enjoyed it more! Perfect',7: 'This place is perfectly awful. I think it should shut down to be honest',8: "I can't understand how anyone would say something negative",9: "It killed me. I'm writing this review as a ghost. That's how bad it was."},'Bogus Field 1': {0: 'foo71',1: 'foo92',2: 'foo25',3: 'foo88',4: 'foo54',5: 'foo10',6: 'foo48',7: 'foo76',8: 'foo4',9: 'foo11'},'Bogus Field 2': {0: 'foo12',1: 'foo66',2: 'foo94',3: 'foo90',4: 'foo97',5: 'foo87',6: 'foo10',7: 'foo4',8: 'foo16',9: 'foo86'},'Sentiment': {0: 1,1: 0,2: 1,3: 0,4: 1,5: 0,6: 1,7: 0,8: 1,9: 0}}    

 df = pd.DataFrame(data_dict,columns=data_dict.keys())
 train,test = train_test_split(df,test_size=0.3,random_state=8)
 train_tagged = train.apply(lambda x: TaggedDocument(words=get_words(x['Review']),tags=x['Sentiment']),axis=1,)

model_dbow = doc2vec(dm=0,vector_size=50,negative=5,hs=0,min_count=1,sample=0,workers=8)
model_dbow.build_vocab([x for x in train_tagged.values])

产生:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-590096b99bf9> in <module>
----> 1 model_dbow.build_vocab([x for x in train_tagged.values])

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self,documents,corpus_file,update,progress_per,keep_raw_vocab,trim_rule,**kwargs)
    926         total_words,corpus_count = self.vocabulary.scan_vocab(
    927             documents=documents,corpus_file=corpus_file,docvecs=self.docvecs,--> 928             progress_per=progress_per,trim_rule=trim_rule
    929         )
    930         self.corpus_count = corpus_count

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self,docvecs,trim_rule)
   1123             documents = TaggedLineDocument(corpus_file)
   1124 
-> 1125         total_words,corpus_count = self._scan_vocab(documents,trim_rule)
   1126 
   1127         logger.info(

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self,trim_rule)
   1069             document_length = len(document.words)
   1070 
-> 1071             for tag in document.tags:
   1072                 _note_doctag(tag,document_length,docvecs)
   1073 

TypeError: 'int' object is not iterable

我不明白 int 类型的来源,因为: print(set([type(x) for x in train_tagged])) 收益:{<class 'gensim.models.doc2vec.TaggedDocument'>}

请注意,其他故障排除,例如:

train_tagged = train.apply(lambda x: TaggedDocument(words=[get_words(x['Review'])],tags=[x['Sentiment']]),)

产量:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-7bd5804d8d95> in <module>
----> 1 model_dbow.build_vocab(train_tagged)

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self,trim_rule)
   1073 
   1074             for word in document.words:
-> 1075                 vocab[word] += 1
   1076             total_words += len(document.words)
   1077 

TypeError: unhashable type: 'list'

解决方法

您的第一次尝试肯定是在 TaggedDocument 实例需要值列表的地方放置单个值——即使只有一个值列表。

我不确定您的第二次尝试出了什么问题,但是您是否查看过 train_tagged 的代表性实例,例如 train_tagged[0],以确保它是:

  • 单个TaggedDocument
  • 具有 tags 值的 list
  • 其中该列表中的每一项都是一个简单的字符串(或在高级用法中,一个从 int 开始的范围内的 0

另请注意,if train_tagged 是正确的 TaggedDocument 实例序列,您可以而且应该将其直接传递给 build_vocab()。 (不需要奇怪的 [x for x in train_tagged.values] 构造。)

更一般地,如果刚开始使用 Doc2Vec,从 Gensim 文档中的简单示例开始会比“Towards Data Science”中的内容更好。 “迈向数据科学”中有大量非常糟糕的代码和被误导的做法。

,

您没有将任何文件传递给您的实际培训师,请参阅带有

的部分
model_dbow = Doc2Vec(dm=0,[...])

0 被解释为整数,这就是您收到错误的原因。 相反,您应该简单地添加您在 gensim docs for Doc2Vec 中详细说明的文档,这样就可以了。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?
Java在半透明框架/面板/组件上重新绘画。
Java“ Class.forName()”和“ Class.forName()。newInstance()”之间有什么区别?
在此环境中不提供编译器。也许是在JRE而不是JDK上运行?
Java用相同的方法在一个类中实现两个接口。哪种接口方法被覆盖?
Java 什么是Runtime.getRuntime()。totalMemory()和freeMemory()?
java.library.path中的java.lang.UnsatisfiedLinkError否*****。dll
JavaFX“位置是必需的。” 即使在同一包装中
Java 导入两个具有相同名称的类。怎么处理?
Java 是否应该在HttpServletResponse.getOutputStream()/。getWriter()上调用.close()?
Java RegEx元字符(。)和普通点?