The sixth page of this http://cs231n.stanford.edu/vecDerivs.pdf says that the gradient of $dYi, / dXi, = W$ but https://www.deeplearningbook.org/contents/mlp.html says it is $GB^T$ on page 212, I am sure there is a gap in my understanding somewhere. I understand how we get the first derivative but then what is $GB^T$, is that also the derivative (sorry if I'm using the words gradient and derivative interchangeably still trying to get a grasp on this subject).
2026-04-11 22:00:49.1775944849
What is the gradient of a matrix product AB?
176 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
There are 1 best solutions below
Related Questions in DERIVATIVES
- Derivative of $ \sqrt x + sinx $
- Second directional derivative of a scaler in polar coordinate
- A problem on mathematical analysis.
- Why the derivative of $T(\gamma(s))$ is $T$ if this composition is not a linear transformation?
- Does there exist any relationship between non-constant $N$-Exhaustible function and differentiability?
- Holding intermediate variables constant in partial derivative chain rule
- How would I simplify this fraction easily?
- Why is the derivative of a vector in polar form the cross product?
- Proving smoothness for a sequence of functions.
- Gradient and Hessian of quadratic form
Related Questions in MATRIX-CALCULUS
- How to compute derivative with respect to a matrix?
- Definition of matrix valued smooth function
- Is it possible in this case to calculate the derivative with matrix notation?
- Monoid but not a group
- Can it be proved that non-symmetric matrix $A$ will always have real eigen values?.
- Gradient of transpose of a vector.
- Gradient of integral of vector norm
- Real eigenvalues of a non-symmetric matrix $A$ ?.
- How to differentiate sum of matrix multiplication?
- Derivative of $\log(\det(X+X^T)/2 )$ with respect to $X$
Related Questions in MACHINE-LEARNING
- KL divergence between two multivariate Bernoulli distribution
- Can someone explain the calculus within this gradient descent function?
- Gaussian Processes Regression with multiple input frequencies
- Kernel functions for vectors in discrete spaces
- Estimate $P(A_1|A_2 \cup A_3 \cup A_4...)$, given $P(A_i|A_j)$
- Relationship between Training Neural Networks and Calculus of Variations
- How does maximum a posteriori estimation (MAP) differs from maximum likelihood estimation (MLE)
- To find the new weights of an error function by minimizing it
- How to calculate Vapnik-Chervonenkis dimension?
- maximize a posteriori
Trending Questions
- Induction on the number of equations
- How to convince a math teacher of this simple and obvious fact?
- Find $E[XY|Y+Z=1 ]$
- Refuting the Anti-Cantor Cranks
- What are imaginary numbers?
- Determine the adjoint of $\tilde Q(x)$ for $\tilde Q(x)u:=(Qu)(x)$ where $Q:U→L^2(Ω,ℝ^d$ is a Hilbert-Schmidt operator and $U$ is a Hilbert space
- Why does this innovative method of subtraction from a third grader always work?
- How do we know that the number $1$ is not equal to the number $-1$?
- What are the Implications of having VΩ as a model for a theory?
- Defining a Galois Field based on primitive element versus polynomial?
- Can't find the relationship between two columns of numbers. Please Help
- Is computer science a branch of mathematics?
- Is there a bijection of $\mathbb{R}^n$ with itself such that the forward map is connected but the inverse is not?
- Identification of a quadrilateral as a trapezoid, rectangle, or square
- Generator of inertia group in function field extension
Popular # Hahtags
second-order-logic
numerical-methods
puzzle
logic
probability
number-theory
winding-number
real-analysis
integration
calculus
complex-analysis
sequences-and-series
proof-writing
set-theory
functions
homotopy-theory
elementary-number-theory
ordinary-differential-equations
circles
derivatives
game-theory
definite-integrals
elementary-set-theory
limits
multivariable-calculus
geometry
algebraic-number-theory
proof-verification
partial-derivative
algebra-precalculus
Popular Questions
- What is the integral of 1/x?
- How many squares actually ARE in this picture? Is this a trick question with no right answer?
- Is a matrix multiplied with its transpose something special?
- What is the difference between independent and mutually exclusive events?
- Visually stunning math concepts which are easy to explain
- taylor series of $\ln(1+x)$?
- How to tell if a set of vectors spans a space?
- Calculus question taking derivative to find horizontal tangent line
- How to determine if a function is one-to-one?
- Determine if vectors are linearly independent
- What does it mean to have a determinant equal to zero?
- Is this Batman equation for real?
- How to find perpendicular vector to another vector?
- How to find mean and median from histogram
- How many sides does a circle have?
$ \def\d{\delta}\def\o{{\tt1}}\def\p{\partial} \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} $Renaming the variables from the first reference from $(Y,X,W)\to(C,A,B)$ makes it comparable to the second reference, i.e. the basic relationship is $$C=AB$$ In the second reference, there is a scalar cost function $z$ which is assumed to be a function of $C$. Not only that, but the gradient wrt $C$ is given by the following matrix $$G=\grad{z}{C}$$ Then it proposes using the chain rule is to calculate the gradient wrt $A$.
However, it is easier to write the differential, then change the independent variable from $C\to A$ $$\eqalign{ dz &= G:dC \\ &= G:\LR{dA\;B} \\ &= GB^T:dA \\ \grad{z}{A} &= GB^T \\ }$$ where $(:)$ denotes the matrix inner product, i.e. $$\eqalign{ X:Y &= \sum_{i=1}^m\sum_{j=1}^n X_{ij}Y_{ij} \;=\; \trace{X^TY} \\ X:X &= \big\|X\big\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in such a product to be rearranged in many different but equivalent ways, e.g. $$\eqalign{ X:Y &= Y:X \\ X:Y &= X^T:Y^T \\ W:XY &= WY^T:X = X^TW:Y \\\\ }$$
Now back to the first reference. It is describing how to calculate something much more complicated $-$ the matrix-by-matrix gradient $\,\grad{C}{A}$
Once again, this can be calculated most easily using differentials $$\eqalign{ C &= AB \\ C_{ij} &= \sum_{p=\o}^D A_{ip}\,B_{pj} \\ dC_{ij} &= \sum_{p=\o}^D dA_{ip}\,B_{pj} \\ \grad{C_{ij}}{A_{\ell k}} &= \sum_{p=\o}^D \grad{A_{ip}}{A_{\ell k}}\;B_{pj} \\ &= \sum_{p=\o}^D \d_{i\ell}\,\d_{pk}\,B_{pj} \\ &= \d_{i\ell}\,B_{kj} \\ }$$ The PDF then sets $\ell=i$ to evaluate the remaining Kronecker delta symbol as $\o$, however leaving the delta symbol intact yields a more general (and useful) result.
As you read more, you will discover that the field of Machine Learning uses a hodge-podge of mathematical notations. Every book or article uses a different approach $-$ and most of them are terrible.