I'm implementing a normalization function in C# that applies mean and variance normalization to an input matrix. The mean and variance are calculated for the entire matrix, not row by row. I'm trying to calculate the gradients of this function using the chain rule, but the gradients I compute don't match the approximations obtained using the finite differences method. I would appreciate any help in understanding why the gradients don't match and if there's an issue in my calculations.
The normalization function applies mean and variance normalization in two steps:
Mean normalization: $y_1 = x - \mu$ Variance normalization: $y_2 = \frac{y_1}{\sqrt{\sigma^2 + \epsilon}}$ Here's the gradient calculation for each step:
Mean normalization:
Gradient of $y_1$ with respect to $x$: $\frac{dy_1}{dx} = 1$
Gradient of $y_1$ with respect to $\mu$: $\frac{dy_1}{d\mu} = -1$
Gradient of $\mu$ with respect to $x$: $\frac{d\mu}{dx} = \frac{1}{N}$, where $N$ is the number of elements in $x$.
Variance normalization:
Gradient of $y_2$ with respect to $y_1$: $\frac{dy_2}{dy_1} = \frac{1}{\sqrt{\sigma^2 + \epsilon}}$
Gradient of $\sigma^2$ with respect to $x$: $\frac{d\sigma^2}{dx} = \frac{2(x - \mu)}{N}$
Gradient of $y_2$ with respect to $\sigma^2$: $\frac{dy_2}{d\sigma^2} = -\frac{1}{2}\left(\frac{y_1}{(\sigma^2 + \epsilon)^{\frac{3}{2}}}\right)$
Gradient of $y_2$ with respect to $\mu$: $\frac{dy_2}{d\mu} = -\frac{1}{N \cdot (\sigma^2 + \epsilon)^{\frac{3}{2}}}$
Gradient of $\sigma^2$ with respect to $\mu$: $\frac{d\sigma^2}{d\mu} = -\frac{2}{N}\sum(x - \mu)$
Combining the gradients:
Using the chain rule, I computed the gradients of the output with respect to $x$ and $\mu$:
Gradient of the output with respect to $x$: $\frac{d\text{Output}}{dx} = \left(\frac{dy_2}{dy_1}\right)\left(\frac{dy_1}{dx}\right) + \left(\frac{dy_2}{d\sigma^2}\right)\left(\frac{d\sigma^2}{dx}\right)$
Gradient of the output with respect to $\mu$: $\frac{d\text{Output}}{d\mu} = \left(\frac{dy_2}{dy_1}\right)\left(\frac{dy_1}{d\mu}\right) + \left(\frac{dy_2}{d\sigma^2}\right)\left(\frac{d\sigma^2}{d\mu}\right) + \frac{dy_2}{d\mu}$
The issue I'm facing is that the computed gradients don't match the approximations obtained using the finite differences method. The differences are not within an acceptable range (max diff is around 3.18), and I'm not sure why this is happening.
Could you please help me understand if there's an issue in my gradient calculations or if there's a better way to compute these gradients?
Here's the NormalizationForward function for reference:
public static double[,] NormalizationForward(double[,] input, int outputHeight, int outputWidth, double epsilon)
{
double[,] output = (double[,])input.Clone();
double mean = 0;
double var = 0;
double N = outputHeight * outputWidth;
// Calculate mean
for (int i = 0; i < outputHeight; i++)
{
for (int j = 0; j < outputWidth; j++)
{
mean += input[i, j];
}
}
mean /= N;
// Calculate variance
for (int i = 0; i < outputHeight; i++)
{
for (int j = 0; j < outputWidth; j++)
{
var += Math.Pow(input[i, j] - mean, 2);
}
}
var /= N;
// Normalize activations
for (int i = 0; i < outputHeight; i++)
{
for (int j = 0; j < outputWidth; j++)
{
output[i, j] = (input[i, j] - mean) / Math.Sqrt(var + epsilon);
}
}
return output;
} ```