如何解决使用多重预测器通过多项式贝叶斯进行文本分类预测
我的数据集仅包含分类变量。我在使用一个分类列来预测另一个数据帧时几乎没有问题,但我发现很难理解如何使用多个列/预测器进行预测。
假设我的数据集如下所示:
ItemCode ItemDescription Kind_of_food
273 Snicker Chocolate
230 Lay's Chips Chips
274 KitKat Chocolate
123 Gummy Bears Candy
124 Oreo Cookies
123 Gummy Bears Candy
273 Snicker Chocolate
. . . x 1000000 rows.
如果我只使用项目描述来预测项目代码,我首先清理了下面未显示的数据集(删除停用词、撇号等)。然后我会通过 train_test_split 运行它。
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.metric import accuracy_score
from nltk.stem.porter import Porterstemmer()
x_train,x_test,y_train,y_test = train_test_split(df['ItemDescription'],df['ItemCode'],train_size = 100000,test_size = 30000,stratify = df['ItemCode']
stemmer = Porterstemmer()
analyzer = CountVectorizer().build_analyzer()
def stemmed(doc):
return(stemmer.stem(w) for w in analyzer(doc))
vect = CountVectorizer(ngram = range(2,2),max_features = 500,stop_words = stopWords,analyzer = stemmed_words,tokenizer = word_tokenizer) # stopWords is defined earlier and not showed in code,X_train = vect.fit_transform(x_train)
X_test = vect.transform(x_test)
multiNB = MultinomialNB(alpha = 0.2)
multiNB.fit(X_train,y_train)
predicted = multiNB.predict(X_test)
print("accuracy of test model is: ",accuracy_score(predicted,y_test))
此代码适用于 1 个预测变量,但如果我要通过虚拟变量组合 Kind of Foods 列。
dummies = pd.getDummies(df.Kind_of_food)
df = pd.concat([df,dummies],axis = 'columns')
df = df.drop(['ItemCode','Cookies'],axis = 1)
然后我创建一个新变量,
X = df[['ItemDescription','Cookies','Chips','Candy','Chocolate']]
并将 train_test_split 更改为:
x_train,stratify = df['ItemCode']
到:
x_train,y_test = train_test_split(X,Y,stratify = Y)
我会得到
Found input variables with inconsistent number of samples [3,100000]
当我尝试运行相同的代码时。
当尝试拟合 x_train (100000,3) 和 y_train (100000) 时,代码在 multiNB.fit 线上中断,我应该如何调整代码并继续?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。