python计数向量化空词汇

如何解决python计数向量化空词汇

我是 nlp 的新手，我一直在尝试使用 ngram 向量化。但是，我收到错误说 ValueError: empty vocabulary; perhaps the documents only contain stop words

以下代码如下

for index,row in df_valid.iterrows():
  txt1=(txt1)
  txt1 = df_valid['profile'][index]
  txt1 = str(txt1)
  print(txt1)
  txt1=[txt1]

  vectorizer = CountVectorizer(token_pattern = r"(?u)\b\w+\b",stop_words=None,ngram_range=(2,2),analyzer='word')

  X1 = vectorizer.fit_transform(txt1) 
  print("\n\nFeatures : \n",features)
  print("\n\nX1 : \n",X1.toarray())

我尝试添加停止，但我也没有工作。请帮忙。提前致谢

解决方法

首先，尝试在min_df=1内设置参数CountVectorizer()。

其次，尝试用空格分隔文本，因为它是迭代列表中的单个。 :)