Doc2Vec build_vocab 方法失败

如何解决Doc2Vec build_vocab 方法失败

我正在关注 this guide 构建 doc2vec gensim 模型。

我创建了一个 MRE 来强调这个问题：

import pandas as pd,numpy as np,warnings,nltk,string,re,gensim
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from nltk.corpus import stopwords
from nltk.stem.snowball import snowballstemmer
from sklearn.model_selection import train_test_split
from gensim.models import doc2vec
from gensim.models.doc2vec import TaggedDocument

def get_words(para):   
    pattern = '([\d]|[\d][\d])\/([\d]|[\d][\d]\/([\d]{4}))'
    stop_words = set(stopwords.words('english'))
    stemmer = snowballstemmer('english')
    no_dates = [re.sub(pattern,'',i) for i in para.lower().split()]
    no_punctuation = [nopunc.translate(str.maketrans('',string.punctuation)) for nopunc in no_dates]
    stemmed_tokens = [stemmer.stem(word) for word in no_punctuation if word.strip() and len(word) > 1 and word not in stop_words]
    
    return stemmed_tokens

data_dict = {'ID': {0: 1,1: 2,2: 3,3: 4,4: 5,5: 6,6: 7,7: 8,8: 9,9: 10},'Review': {0: "Even though the restauraunt was gross,the food was still good and I'd recommend it",1: 'My waiter was awful,my food was awful,I hate it all',2: 'I did not enjoy the food very much but I thought the waitstaff was fantastic',3: 'Even though the cleanliness level was fantastic,my food was awful',4: 'Everything was mediocre,but I guess mediocre is better than bad Nowadays',5: "Honestly there wasn't a single thing that was mediocre about this place",6: 'I Could not have enjoyed it more! Perfect',7: 'This place is perfectly awful. I think it should shut down to be honest',8: "I can't understand how anyone would say something negative",9: "It killed me. I'm writing this review as a ghost. That's how bad it was."},'Bogus Field 1': {0: 'foo71',1: 'foo92',2: 'foo25',3: 'foo88',4: 'foo54',5: 'foo10',6: 'foo48',7: 'foo76',8: 'foo4',9: 'foo11'},'Bogus Field 2': {0: 'foo12',1: 'foo66',2: 'foo94',3: 'foo90',4: 'foo97',5: 'foo87',6: 'foo10',7: 'foo4',8: 'foo16',9: 'foo86'},'Sentiment': {0: 1,1: 0,2: 1,3: 0,4: 1,5: 0,6: 1,7: 0,8: 1,9: 0}}    

 df = pd.DataFrame(data_dict,columns=data_dict.keys())
 train,test = train_test_split(df,test_size=0.3,random_state=8)
 train_tagged = train.apply(lambda x: TaggedDocument(words=get_words(x['Review']),tags=x['Sentiment']),axis=1,)

model_dbow = doc2vec(dm=0,vector_size=50,negative=5,hs=0,min_count=1,sample=0,workers=8)
model_dbow.build_vocab([x for x in train_tagged.values])

产生：

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-590096b99bf9> in <module>
----> 1 model_dbow.build_vocab([x for x in train_tagged.values])

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self,documents,corpus_file,update,progress_per,keep_raw_vocab,trim_rule,**kwargs)
    926         total_words,corpus_count = self.vocabulary.scan_vocab(
    927             documents=documents,corpus_file=corpus_file,docvecs=self.docvecs,--> 928             progress_per=progress_per,trim_rule=trim_rule
    929         )
    930         self.corpus_count = corpus_count

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self,docvecs,trim_rule)
   1123             documents = TaggedLineDocument(corpus_file)
   1124 
-> 1125         total_words,corpus_count = self._scan_vocab(documents,trim_rule)
   1126 
   1127         logger.info(

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self,trim_rule)
   1069             document_length = len(document.words)
   1070 
-> 1071             for tag in document.tags:
   1072                 _note_doctag(tag,document_length,docvecs)
   1073 

TypeError: 'int' object is not iterable

我不明白 int 类型的来源，因为： print(set([type(x) for x in train_tagged])) 收益：{<class 'gensim.models.doc2vec.TaggedDocument'>}

请注意，其他故障排除，例如：

train_tagged = train.apply(lambda x: TaggedDocument(words=[get_words(x['Review'])],tags=[x['Sentiment']]),)

产量：

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-7bd5804d8d95> in <module>
----> 1 model_dbow.build_vocab(train_tagged)

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self,trim_rule)
   1073 
   1074             for word in document.words:
-> 1075                 vocab[word] += 1
   1076             total_words += len(document.words)
   1077 

TypeError: unhashable type: 'list'

解决方法

您的第一次尝试肯定是在 TaggedDocument 实例需要值列表的地方放置单个值——即使只有一个值列表。

我不确定您的第二次尝试出了什么问题，但是您是否查看过 train_tagged 的代表性实例，例如 train_tagged[0]，以确保它是：

单个TaggedDocument
具有 tags 值的 list
其中该列表中的每一项都是一个简单的字符串（或在高级用法中，一个从 int 开始的范围内的 0）

另请注意，if train_tagged 是正确的 TaggedDocument 实例序列，您可以而且应该将其直接传递给 build_vocab()。（不需要奇怪的 [x for x in train_tagged.values] 构造。）

更一般地，如果刚开始使用 Doc2Vec，从 Gensim 文档中的简单示例开始会比“Towards Data Science”中的内容更好。 “迈向数据科学”中有大量非常糟糕的代码和被误导的做法。

您没有将任何文件传递给您的实际培训师，请参阅带有

的部分

model_dbow = Doc2Vec(dm=0,[...])

此 0 被解释为整数，这就是您收到错误的原因。相反，您应该简单地添加您在 gensim docs for Doc2Vec 中详细说明的文档，这样就可以了。

Doc2Vec build_vocab 方法失败

如何解决Doc2Vec build_vocab 方法失败

解决方法

相关推荐