如何解决使用 keras

我有一个包含许多分类特征和许多特征的数据集。我想应用嵌入层将分类数据转换为数值数据以供其他模型使用。但是，我在训练过程中遇到了一些错误。现在，我的训练过程是：

对分类特征执行标签编码器
通过 train_test_split() 函数拆分训练和测试数据
删除数字列。仅发送用于模型训练的分类特征和目标 y。

我收到了这个错误：

    indices[13,0] = 10 is not in [0,10)
     [[node functional_1/embed_6/embedding_lookup (defined at <ipython-input-34-0b6b3ae455d0>:4) ]] [Op:__inference_train_function_3509]

Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embed_6/embedding_lookup:
 functional_1/embed_6/embedding_lookup/2395 (defined at /usr/lib/python3.6/contextlib.py:81)

Function call stack:
train_function

搜索后，有人说问题是embedding layer的vocabulary_size参数不对。放大词汇表大小可以解决这个问题。但就我而言，我需要将结果映射回原始标签。

例如，我有一个分类特征 ['dog','cat','fish']。标签编码后，变成[0,1,2]。具有 3 个唯一变量的此功能的嵌入层应输出类似 ([-0.22748041],[-0.03832678],[-0.16490786])。然后我可以将原始数据中的['dog']变量替换为-0.22748041，将['cat']变量替换为-0.03832678，依此类推。所以，我不能改变词汇表的大小，否则输出维度会出错。

我想我的问题是不是所有的分类变量都进入了训练过程。（例如，训练数据中只有 ['dog','fish']。['cat'] 仅出现在测试数据中）。如果我将词汇表的大小设置为 3，则会报告如上的错误。如果我实验性地将 ['cat'] 添加到训练数据中。它工作正常。

我的问题是，剂量嵌入层必须查看训练过程中的所有唯一值才能执行我想要的应用程序？如果分类数据有很多唯一值，那么在拆分数据时如何保证所有唯一值都出现在测试数据中。

提前致谢！

解决方法

解决方案

在创建查找表时，您需要使用词表外桶。 oov 存储桶允许在测试期间查找未知类别。

解决方案有什么作用？

将其设置为所需的数字（如 1000）将允许您获取其他类别的 id，而这些 id 在测试数据类别中不存在。

words = tf.constant(vocabulary)
word_ids = tf.range(len(vocabulary),dtype=tf.int64)

# important
vocab_init = tf.lookup.KeyValueTensorInitializer(words,word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init,num_oov_buckets) # lokup table for ids->category

然后可以对训练集进行编码（我使用的是 TensorFlow Dataset IMDb 评分数据集）

def encode_words(X_batch,y_batch):
  """
  Encode the training set converting words to IDs
  using the lookup table just created
  """
  return table.lookup(X_batch),y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

创建模型时：

vocab_size=10000    # whatever the length of variable vocabulary is of
embedding_size = 128  # tweakable | hyperparameter
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets,embedding_size,input_shape=[None]),# usual code follows
])

并拟合数据

model.compile(loss="binary_crossentropy",optimizer="adam",metrics="accuracy")
history = model.fit(train_set,epochs=5)