微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

在 keras 中使用神经网络进行文本分类 - 模型很弱

如何解决在 keras 中使用神经网络进行文本分类 - 模型很弱

我正在尝试对圣经中的经文进行分类,问题是我的模型不好,我找不到改进它的方法

这是我的代码

import tensorflow.keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout,Activation
from tensorflow.keras.layers import MaxPooling2D,Conv2D
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint,EarlyStopping
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import SpatialDropout1D
from sklearn.model_selection import train_test_split
from tensorflow.keras import regularizers

import pandas as pd              
import numpy as np 

data = pd.read_csv("bible_data_set (with count and testament).csv")
data

enter image description here

import nltk
from nltk.stem import Porterstemmer

ps = Porterstemmer() 

vocabulary_size = 0
word2location = {}

def prepare_vocabulary(data):
    index = 0
    for sentance in data['text']:
        #sentance = sentance.lower()
        words = nltk.word_tokenize(sentance)
        for word in words:
            stemed_word = ps.stem(word)
            if stemed_word not in word2location:
                word2location[stemed_word] = index
                index += 1
    return index

def convert2vec(sentance):
    #sentance = sentance.lower()
    res_vec = np.zeros(vocabulary_size)
    words = nltk.word_tokenize(sentance)
    for word in words:
        stemed_word = ps.stem(word)
        if stemed_word in word2location:
            res_vec[word2location[stemed_word]]+=1
    return res_vec

books = ['Genesis','Exodus','Leviticus','Numbers','Deuteronomy','Joshua','Judges','Ruth','1 Samuel','2 Samuel','1 Kings','2 Kings','1 Chronicles','2 Chronicles','Ezra','Nehemiah','Esther','Job','Psalms','Proverbs','Ecclesiastes','Song of Solomon','Isaiah','Jeremiah','Lamentations','Ezekiel','Daniel','Hosea','Joel','Amos','Obadiah','Jonah','Micah','Nahum','Habakkuk','Zephaniah','Haggai','Zechariah','Malachi','Matthew','Mark','Luke','John','Acts','Romans','1 Corinthians','2 Corinthians','galatians','Ephesians','Philippians','Colossians','1 Thessalonians','2 Thessalonians','1 Timothy','2 Timothy','Titus','Philemon','Hebrews','James','1 Peter','2 Peter','1 John','2 John','3 John','Jude','Revelation']

def encode(line):
    res_vec = np.zeros(66)
    idx = books.index(data.iloc[line]['book'])
    res_vec[idx] = 1
    return res_vec

vocabulary_size = prepare_vocabulary(data)
print("the size of the vocabulary is: ",vocabulary_size)
word2location

enter image description here

import random

rand = []
for r in range (4500):
    ra = random.randrange(0,31101)
    if(ra not in rand):
        rand.append(ra)
            
train_x = []
train_y = []
test_x = []
test_y = []
for i in range(len(data['text'])):
    if(i not in rand):
        train_x.append(i)
        train_y.append(i)
        
    elif(i in rand):
        test_x.append(i)
        test_y.append(i)
data_x = np.array([convert2vec(data.iloc[i]['text']) for i in train_x])
np.random.shuffle(data_x)
data_y = np.array([encode(i) for i in train_y])
np.random.shuffle(data_y)
test_data_x = np.array([convert2vec(data.iloc[i]['text']) for i in test_x])
np.random.shuffle(test_data_x)
test_data_y = np.array([encode(i) for i in test_y])
np.random.shuffle(test_data_y)

model = Sequential()
model.add(Dense(128,activation = 'sigmoid',input_dim = vocabulary_size))
model.add(Dropout(0.1))
model.add(Dense(128,activation = 'sigmoid'))
model.add(Dropout(0.1))
model.add(Dense(66,activation = 'softmax'))

opt = SGD(lr=0.01)

model.compile(loss='categorical_crossentropy',optimizer=opt,metrics=['accuracy'])
history = model.fit(data_x,data_y,epochs=50,batch_size=16,validation_data=(test_data_x,test_data_y),callbacks=[EarlyStopping(monitor='val_loss',patience=5,min_delta=0.00001)])

enter image description here

我一直过拟合或欠拟合。 我已经尝试过为密集激活 relu,并更改了损失函数和优化器,但没有任何帮助。 有什么我遗漏的吗?

解决方法

这里

data_x = np.array([convert2vec(data.iloc[i]['text']) for i in train_x])
np.random.shuffle(data_x)
data_y = np.array([encode(i) for i in train_y])
np.random.shuffle(data_y)
test_data_x = np.array([convert2vec(data.iloc[i]['text']) for i in test_x])
np.random.shuffle(test_data_x)
test_data_y = np.array([encode(i) for i in test_y])
np.random.shuffle(test_data_y)

您为您的火车数据 (data_x) 调用了 np.random.shuffle,为您的火车标签 (data_y) 调用了 np.random.shuffle。这不应该是正确的,因为您的特征应该与您的标签保持配对。只需将它们配对并随机洗牌一次,然后进行相同的测试。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。