如何解决反向传播实现 - 如何在矩阵上应用链式法则对于dL/dX对于dL/dWdL/dXdL/dW.T

`dL/dX` 的梯度使用链式法则

提供L是神经网络的损失，X是输入，Y是点积Y = X•W = np.dot(X,W)的输出。

根据链式法则，dL/dX → dY/dX • dL/dY → W.T • dL/dY 因为产品 dY/dX = W 的 Y = X•W。

问题 1

如何将链式法则公式 dL/dX → W.T • dL/dY 应用于矩阵？由于 W.T 为 (4,3) 且 dL/dY 为 (4,) 的形状不匹配，因此简单地将其应用如下不起作用。

我可以应用什么想法、原理或转变来克服这个问题？我认为矩阵需要不同的思维。

        # gradient dy (dL/dY) back-propagated from the posterior layer
        dy = self.posterior.backward()    

        # Apply chain-rule dL/dX = dY/dX @ dL/dY where dY/dX = W.T
        dx = np.dot(self.w.T,dy)

注意：图中有错别字。 (,4) 是 (4,) 等等。在我的大脑，4 个元素的一维数组是 (,4) 但在 NumPy 中，它是 (4,)。

问题 2

必须转置 W.T 和 X.T 才能使链式法则起作用的基本原理是什么？如果我转置 W，我想我可以在不转置的情况下使用 dL/dY，但请帮助理解。

对于`dL/dX`

我看到一个答案是交换位置，但不知道它来自哪里以及为什么。为什么可以改变链式法则中元素的顺序？

        # dL/dX = dL/dY • W.T instead of W.T • dL/dY 
        dx = np.dot(dy,self.w.T)   # dy(4,) @ w.T(4,3) -> (3,)

对于`dL/dW`

在下图中的答案中，X.T (,3) 和 dL/dY (,4) 的形状被转换为 (3,1) 和 (1,4) 以匹配形状（实际上是 (2,1) and (1,3)但与上面的快照保持一致），但不确定它来自哪里以及背后的基本原理是什么。

回答

deep-learning-from-scratch/common/layers.py

    def backward(self,dout):
        dx = np.dot(dout,self.W.T)
        self.dW = np.dot(self.x.T,dout)
        self.db = np.sum(dout,axis=0)
        
        dx = dx.reshape(*self.original_x_shape)  # 入力データの形状に戻す（テンソル対応）
        return dx

代码

正在编码中，未测试，无法工作。

class Affine(object):
    """Affine (MatMul) Layer"""
    def __init__(self,units,weights,optimizer,posteriors: List[object]):
        """Initialize the affine layer.
        
        [X] shape(size,n)
        Aka Batch. An array of input data x with n features (n: 0,1,...,n). n=0 is a bias.
        j-th input X[j] is [x(j)(0),x(j)(1),... x(j)(n)] where bias 'x(j)(0)' is 1.
        Use capital X for batch and x for its individual input.
        
        NOTE: "input" is not limited to the first input data layer e.g. image pixels,but "input" at any layer.

        [weights] shape(n,units)
        k-th neuron (k:0,.. size-1) has its weight vector W(k):[w(k)(0),w(k)(1),... w(k)(n)].
        w(k)(0) is its bias weight. Each w(k)(i) amplifies i-th feature in the input x.  
                
        Args:
            units: number of neurons in the layer
            weights: array of weight-vectors of each neuron. shape(n,size)
            optimizer: gradient descent implementation e.g SGD,Adam.
            posteriors: next layers
        """
        # neuron weight vectors
        self.w: numpy.ndarray = weights  # weight vector per neuron
        self.n: int = weights.shape[0]   # number of features expected
        self.dw: numpy.ndarray = None    # gradient of W
        
        self.X: numpy.ndarray = np.empty(0,self.n)     # Batch input
        self.m: int  = -1                # batch size: X.shape[0]

        self.posterior = posteriors[0]
        
        
    def forward(self,X):
        """Forward propagation of the affine layer X@W"""
        # X@W from X(m,n) @ W(n,units) to generate output Y(m,units)
        self.m = self.X.shape[0] if self.X is not None else -1
        Y = np.dot(self.X,self.w)
        self.posterior.forward(Y)


    def backward(self):
        # --------------------------------------------------------------------------------
        # Back propagation dy from the posterior layer. dy shape must match that of Y(m,units)
        # --------------------------------------------------------------------------------
        dy = self.posterior.backward()    # gradient back-propagated from the posterior 
        assert(dy.shape[0] == self.m),\
        "gradient dy shape {} must match output Y shape ({},{})".format(
            dy.shape,self.m,self.n
        )

        # --------------------------------------------------------------------------------
        # Gradient descent on W
        # --------------------------------------------------------------------------------
        dw = np.dot(self.X.T,dy)
        self.w = self.optimizer.(self.w,dw)

        dx = np.dot(dy,self.w)
        return dx