TensorFlow 的 Visual Attention 示例是否适用于 im2latex 问题？

如何解决TensorFlow 的 Visual Attention 示例是否适用于 im2latex 问题？

我目前正在尝试为 im2latex 问题制定自己的解决方案。我在 github 上看到了一些项目，它们都使用 Visual Attention 机制来检测照片上的 LaTeX 符号。我之前只使用过密集和卷积神经网络，所以这个主题对我来说是新的。

TensorFlow 有一个关于这个主题的精彩教程 (https://www.tensorflow.org/tutorials/text/image_captioning)，我遵循了它，结果非常好 (example)。

下一步是找到一个包含 LaTeX 方程照片和标题的数据集，我使用了“im2latex-100k”(https://zenodo.org/record/56198#.YOeHcOgzaUl)。实际上，我对 Tokenizer 的给定标题进行了一些更改，以便更轻松地使用它们，并且还重新渲染了部分给定照片以更准确地拟合方程 (equation photos and their captions)。 Tokenizer 与 TF 的示例几乎相同：

self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.top_k,split=' ',oov_token="<unk>",lower=False)
self.tokenizer.fit_on_texts(train_captions)
self.tokenizer.word_index['<pad>'] = 0  
self.tokenizer.index_word[0] = '<pad>'

NN 机制（CNN_encoder、RNN_decoder、BahdanauAttention）与示例相同。

因此，我尝试训练多个模型，即使对于最强大的模型，结果也是 really bad。此模型具有以下参数：

top_k = 500  # only 500 most common LaTeX words are used in tokenizer 
image_count = 50000  # 50000 photos were used to train the model
BATCH_SIZE = 64
BUFFER_SIZE = 100
embedding_dim = 128
units = 128
EPOCHS = 20

经过 20 个 epoch 的训练，最终的 LOSS 约为 0.22。

模型不仅给出了完全错误的标题，而且每个预测的标题本身也完全不同。

那么，问题是 - 为什么会这样？这是缺少照片，还是模型不适合im2latex问题？提前致谢！