如何解决TfidfVectorizer 将我的数据帧从 799 缩小到 3
我有包含文本列的数据框
和多标签值
RepID、RepText、代码 1 这是一个测试。感谢您购买...水果,肉类 2 买了牛奶,香蕉,我也买了……乳制品,水果,其他
这是我的代码
######## df has 1000 records
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df['Code'])
y = multilabel_binarizer.transform(df['Code'])
X = df[df.columns.difference(["Code"])]
######## df split into X (RepID,RepText)
######## and y (Code)
xtrain,xval,ytrain,yval = train_test_split(X,y,test_size=0.2,random_state=9)
##### xtrain.shape = (800,3)
##### xval.shape = (200,3)
##### ytrain.shape = (800,1725)
##### yval.shape = (200,1725)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,max_features=10000)
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)
##### But after the code above
##### xtrain_tfidf.shape = (3,3)
##### xval_tfidf.shape = (3,1725)
##### when means when I do the next line
xval_tfidf.shape
#mdl = LinearRegression()
mdl = LogisticRegression()
#mdl = SVC(gamma='auto',probability=True)
clf = OneVsRestClassifier(mdl)
clf.fit(xtrain_tfidf,ytrain)
我收到此错误
ValueError: Found input variables with inconsistent numbers of samples: [3,799]
为什么?为什么我在 TfidfVectorizer 行之后只得到 3 条记录而不是 800 条记录?
当我试图查看 xtrain_tfidf 中的内容时,我得到了这个
xtrain_tfidf
Out[56]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
解决方法
我找到原因了
我忘记在拆分记录时只选择文本列
xtrain,xval,ytrain,yval = train_test_split(X["RepText"],y,test_size=0.2,random_state=9)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。