如何使用保存的文本分类模型对新文本数据集进行预测

如何解决如何使用保存的文本分类模型对新文本数据集进行预测

我在此指南下训练了文本分类器：https://developers.google.com/machine-learning/guides/text-classification/step-4

并将模型另存为

 var transporter = nodemailer.createTransport({
      service: 'gmail',host:'smtp.gmail.com',auth: {
        user: 'example@gmail.com',pass:'EmailPassword'
      }
    });

    var mailOptions = {
      from: 'Express Delivery',to: req.body.name,subject: 'password reset',text: ``,html:'<h1>express delivery</h1><hr><p><h2>your verification code is : '+v_code+'</h2><h2>please enter this code to reset your account</h2></p>'
    };

    transporter.sendMail(mailOptions,function(error,info){
      if (error) {
        res.status(500);
          res.send({'data':'email sending fail','err':error});
      } else {
        console.log('Email sent: ' + info.response);
          res.status(200);
          res.send({'data':'sent verification code'});
        });
      }
    });

在这种情况下，我如何使用此模型对另一个新数据集上的文本进行分类？

谢谢

解决方法

import tensorflow as tf

# Recreate the exact same model,including its weights and the optimizer
new_model = tf.keras.models.load_model('~./output/model.h5')

# Show the model architecture
new_model.summary()

# Apply the same process of data preparation while training the model.
# Lets say after Data preprocessing you have stored the processed data in test_data

# check model accuracy from unseen/new dataset
loss,acc = new_model.evaluate(test_data,test_labels,verbose=2)
print('Restored model,accuracy: {:5.2f}%'.format(100*acc))

您可以使用tensorflow的Text标记化实用程序类（Tokenizer）处理测试数据中的未知单词。

Num_words是词汇量（它选择最常用的单词）
分配oov_token ='某些字符串'，用于vocab大小以外的所有标记/单词（基本上，测试数据中的新单词将作为oov_token字符串处理。
适合训练数据，然后为训练和测试数据生成令牌序列。

tf.keras.preprocessing.text.Tokenizer（ num_words = None，filters ='！“＃$％＆（）* +，-。/ :; ？@ [\] ^ _`{|}〜\ t \ n'，lower = True， split =''，char_level = False，oov_token = None，document_count = 0，** kwargs ）