微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

我在文本分类问题中有一个数据类型问题

如何解决我在文本分类问题中有一个数据类型问题

我想为 Kickstarter 活动预测构建深度学习分类器。我的模型部分有问题,但我无法解决这个问题。

我的代码

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential
from keras import layers


df = pd.read_csv('../input/kickstarter-campaigns-dataset/kickstarter_data_full.csv')

df_X = [] # for x class
df_y = [] # for labels

for i in range(len(df)):
    tmp = str(df['blurb'][i]) + " " + str(df['goal'][i]) + " " + str(df['pledged'][i]) + " " + str(df['country'][i]) + " " + str(df['currency'][i]) + " " + str(df['category'][i]) + " " + str(df['spotlight'][i])  
    df_X.append(tmp)
    df_y.append(str(df['SuccessfulBool'][i]))

X_train,X_test,y_train,y_test = train_test_split(df_X,df_y,test_size=0.25,random_state=1000)
vectorizer = CountVectorizer()
vectorizer.fit(X_train)

X_train = vectorizer.transform(X_train)
X_test  = vectorizer.transform(X_test)

input_dim = X_train.shape[1]

model = Sequential()
model.add(layers.Dense(10,input_dim=input_dim,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

history = model.fit(X_train,epochs=100,verbose=False,validation_data=(X_test,y_test),batch_size=10)

此时,我得到 ValueError: Failed to find data adapter that can handle input: ,( contains values类型 {""})

我尝试使用 np.asarray 来解决

X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
X_test = np.asarray(X_test)
y_test = np.asarray(y_test)

我收到这个ValueError:无法将 NumPy 数组转换为张量(不支持的对象类型 csr_matrix)。

因此,我使用这个:

np.asarray(X_train).astype(np.float32)
np.asarray(y_train).astype(np.float32)
np.asarray(X_test).astype(np.float32)
np.asarray(y_test).astype(np.float32)

但是我得到ValueError:设置一个带有序列的数组元素。

我试试这个:

X_train = np.expand_dims(X_train,-1)
y_train   = np.expand_dims(y_train,-1)
X_test = np.expand_dims(X_test,-1)
y_test   = np.expand_dims(y_test,-1)

但我在历史部分不断收到同样的错误ValueError:无法将 NumPy 数组转换为张量(不支持的对象类型 csr_matrix)。

我在 kaggle 使用 Kickstarter 活动数据集学习。 https://www.kaggle.com/sripaadsrinivasan/kickstarter-campaigns-dataset

我没有足够的 NLP 信息。我搜索并尝试解决,但我无法解决。这是我的家庭作业。你能帮我解决这个问题吗?

df_X 和 df_y 大小相等,输出如下: x y

解决方法

您需要在 NN 的顶部添加一个嵌入层来对单词进行矢量化。像这样:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers


df = pd.read_csv('../input/kickstarter-campaigns-dataset/kickstarter_data_full.csv')

df_X = [] # for x class
df_y = [] # for labels

for i in range(len(df)):
    tmp = str(df['blurb'][i]) + " " + str(df['goal'][i]) + " " + str(df['pledged'][i]) + " " + str(df['country'][i]) + " " + str(df['currency'][i]) + " " + str(df['category'][i]) + " " + str(df['spotlight'][i])  
    df_X.append(tmp)
    df_y.append(str(df['SuccessfulBool'][i]))

vocab_size = 1000
encoded_docs = [one_hot(d,vocab_size) for d in df_X]
max_length = 20
padded_docs = pad_sequences(encoded_docs,maxlen=max_length,padding='post')
X_train,X_test,y_train,y_test = train_test_split(padded_docs,np.array(df_y)[:,None].astype(int),test_size=0.25,random_state=1000)
model = Sequential()
model.add(layers.Embedding(vocab_size,100,input_length=max_length))
model.add(layers.Flatten())
model.add(layers.Dense(10,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model.summary())
model.fit(X_train,epochs=50,verbose=1,validation_data=(X_test,y_test),batch_size=10)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。