如何使用带有可变长度特征和标签的TF CTC丢失

如何解决如何使用带有可变长度特征和标签的TF CTC丢失

我想用Tensorflow实现具有CTC丢失的语音识别器。输入要素的长度可变，因为每种语音发音的长度都可以变化。标记的长度也可变，因为每次转录都不同。我手动填充要素以创建批次，在我的模型中，我拥有tf.keras.layers.Masking（）层来创建并通过网络传播遮罩。我还使用填充创建了批处理标签。

这是一个虚拟的例子。假设我有两个长度分别为3和5帧的讲话。每帧都由一个功能代表（通常是13个MFCC，但为了简化起见，我将其简化为一个）。因此，要创建批处理，我将简短的话语填充为0：

features = np.array([1.5 2.3 4.6 0.0 0.0],[1.7 2.6 3.4 2.3 1.0])

标签是这些话语的转录。假设长度分别为2和3。标签的批处理形状将为[2、3、26]，其中批处理大小为2，最大长度为3，英文字符为26（一次性编码）。

模型是：

input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(26,return_sequences=True)(input_)
output_ = tf.keras.layers.softmax(axis=-1)(x)
model = tf.keras.Model(input_,output_)

损失函数类似于：

def ctc_loss(y_true,y_pred):
   # Do something here to get logit_length and label_length?
   # ...
   loss = tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,label_length)

我的问题是如何获取logit_length和label_length。我想将logit_length编码在掩码中，但是如果执行y_pred._keras_mask，则结果为None。对于label_length，信息在张量本身中，但是我不确定获取它的最有效方法。

谢谢。

更新：

按照你的回答，我使用tf.math.count_nonzero来获取label_length，并将logit_length设置为logit层的长度。

因此损失函数中的形状为（批量大小= 10）：

y_true.shape = (10,None)
y_pred.shape = (10,None,27)
label_length.shape = (10,1)
logit_lenght.shape = (10,1)

当然y_true和y_pred的“ None”是不相同的，因为一个是批处理的最大字符串长度，另一个是批处理的最大时间范围。但是，当我使用这些参数调用model.fit（）并在tf.keras.backend.ctc_batch_cost（）丢失时，出现错误：

Traceback (most recent call last):
  File "train.py",line 164,in <module>
    model.fit(dataset,batch_size=batch_size,epochs=10)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py",line 66,in _method_wrapper
    return method(self,*args,**kwargs)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py",line 848,in fit
    tmp_logs = train_function(iterator)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py",line 580,in __call__
    result = self._call(*args,**kwds)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py",line 644,in _call
    return self._stateless_fn(*args,**kwds)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py",line 2420,in __call__
    return graph_function._filtered_call(args,kwargs)  # pylint: disable=protected-access
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py",line 1661,in _filtered_call
    return self._call_flat(
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py",line 1745,in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py",line 593,in call
    outputs = execute.execute(
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/execute.py",line 59,in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle,device_name,op_name,tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Incompatible shapes: [10,92] vs. [10,876]
         [[node Equal (defined at train.py:164) ]]
  (1) Invalid argument:  Incompatible shapes: [10,876]
         [[node Equal (defined at train.py:164) ]]
         [[ctc_loss/Log/_62]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_3156]

Function call stack:
train_function -> train_function

似乎在抱怨y_true（92）的长度与y_pred（876）的长度不同，我认为不应该这样。我想念什么？

解决方法

至少对于Tensorflow的最新版本（2.2及更高版本），Softmax层支持屏蔽，屏蔽值的输出不是零，而是仅重复先前的值。

features = np.array([[1.5,2.3,4.6,0.0,0.0],[1.7,2.6,3.4,1.0]])

input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)

x = tf.keras.layers.GRU(2,return_sequences=True)(x)

output_ = tf.keras.layers.Softmax(axis=-1)(x)

model = tf.keras.Model(input_,output_)

r = model(features)
print(r)

第一个样本的输出具有与掩码对应的重复值：

<tf.Tensor: shape=(2,5,2),dtype=float32,numpy=array([[[0.53308547,0.46691453],[0.5477166,0.45228338],[0.55216545,0.44783455],0.44783455]],[[0.532052,0.46794805],[0.54557794,0.454422  ],[0.55263203,0.44736794],[0.56076777,0.4392322 ],[0.5722393,0.42776066]]],dtype=float32)>

要获取序列的非掩码值（label_length），我正在使用tf。版本 == 2.2，这对我有用：

get_mask = r._keras_mask

您可以从get_mask张量值中提取label_length：

   <tf.Tensor: shape=(2,5),dtype=bool,numpy=array([[ True,True,False,False],[ True,True]])>

或者您可以通过计算张量y_true中与零不同的值来获得label_length：

label_length = tf.math.count_nonzero(y_true,axis=-1,keepdims=True)

对于logit_length的值，我见过的所有实现都只是返回time_step的长度，因此logit_length可以是：

logit_length = tf.ones(shape = (your_batch_size,1 ) * time_step

或者您可以使用遮罩张量获取未遮罩的time_step：

logit_length = tf.reshape(tf.reduce_sum( 
        tf.cast(y_pred._keras_mask,tf.float32),axis=1),(your_batch_size,-1) )

这是一个完整的示例：

features = np.array([[1.5,[1.5,2.0,1.0]]).reshape(2,1)  
labels = np.array([[1.,2.,3.,0.,0.],[1.,1.]]).reshape(2,5 ) 

input_ = tf.keras.Input(shape=(5,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(5,return_sequences=True)(x)# 5 is the number of classes + blank .(in your case == 26 + 1)
output_ =  tf.keras.layers.Softmax(axis = -1)(x) 

model = tf.keras.Model(input_,output_)


def ctc_loss(y_true,y_pred):

  label_length = tf.math.count_nonzero(y_true,keepdims=True) 
  logit_length = tf.reshape(tf.reduce_sum(
                 tf.cast(y_pred._keras_mask,(2,-1) ) 
                      
  loss =tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,label_length)
  return  tf.reduce_mean(loss)

model.compile(loss =ctc_loss,optimizer = 'adam')
model.fit(features,labels,epoch = 10)