微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用多重预测器通过多项式贝叶斯进行文本分类预测

如何解决使用多重预测器通过多项式贝叶斯进行文本分类预测

我的数据集仅包含分类变量。我在使用一个分类列来预测另一个数据帧时几乎没有问题,但我发现很难理解如何使用多个列/预测器进行预测。

假设我的数据集如下所示:

ItemCode  ItemDescription  Kind_of_food 
273          Snicker         Chocolate 
230          Lay's Chips       Chips
274          KitKat          Chocolate
123          Gummy Bears       Candy
124          Oreo            Cookies 
123          Gummy Bears       Candy  
273          Snicker        Chocolate          

. . . x 1000000 rows.

如果我只使用项目描述来预测项目代码,我首先清理了下面未显示的数据集(删除停用词、撇号等)。然后我会通过 train_test_split 运行它。

import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.metric import accuracy_score
from nltk.stem.porter import Porterstemmer()



x_train,x_test,y_train,y_test = train_test_split(df['ItemDescription'],df['ItemCode'],train_size = 100000,test_size = 30000,stratify = df['ItemCode']

stemmer = Porterstemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed(doc):
  return(stemmer.stem(w) for w in analyzer(doc))

vect = CountVectorizer(ngram = range(2,2),max_features = 500,stop_words = stopWords,analyzer = stemmed_words,tokenizer = word_tokenizer) # stopWords is defined earlier and not showed in code,X_train = vect.fit_transform(x_train)
X_test = vect.transform(x_test)

multiNB = MultinomialNB(alpha = 0.2)
multiNB.fit(X_train,y_train)
predicted = multiNB.predict(X_test)

print("accuracy of test model is: ",accuracy_score(predicted,y_test))

代码适用于 1 个预测变量,但如果我要通过虚拟变量组合 Kind of Foods 列。

dummies = pd.getDummies(df.Kind_of_food)
df = pd.concat([df,dummies],axis = 'columns')
df = df.drop(['ItemCode','Cookies'],axis = 1)

然后我创建一个新变量,

X = df[['ItemDescription','Cookies','Chips','Candy','Chocolate']] 

并将 train_test_split 更改为:

x_train,stratify = df['ItemCode']

到:

x_train,y_test = train_test_split(X,Y,stratify = Y)

我会得到

Found input variables with inconsistent number of samples [3,100000]

当我尝试运行相同的代码时。

当尝试拟合 x_train (100000,3) 和 y_train (100000) 时,代码在 multiNB.fit 线上中断,我应该如何调整代码并继续?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。