I would like to understand why in a single-layer network, given that all neurons use the same training examples (all of them), the gradient descent method does not reach the same solution for the weights of each neuron. In other words, if the equations are identical for the activation function of each neuron, why does the solution converge to different weights for each neuron?
2026-03-26 17:13:24.1774545204
On
Why doesn't optimization result in the same weights for all neurons in the same layer?
87 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
2
There are 2 best solutions below
0
On
If you started with all weights set to zero, say, then yes, they should all update exactly the same way and all the weights on a given layer would be the same. In that case, the expressive power of the neural network would be very limited. To break this symmetry, the weights of the network are always initialized randomly before training begins.
Previous answers/comments have pointed out that this is precisely the reason why random initialization is done in practice. To elaborate a bit further:
Suppose we have $n$ samples with inputs $\mathbf{X} \in \mathbb{R}^{n \times k}$, and a single hidden layer with $d$ neurons, with outputs $\mathbf{Y} \in \mathbb{R}^{n \times 1}$.
Then the neural network $f$ will be performing the mapping: $$ f(\mathbf{X}) = \sigma(\sigma(\mathbf{XW + b})\mathbf{W}_2 + \mathbf{b}_2) $$ where $\mathbf{W} \in \mathbb{R}^{k \times d}$, $\mathbf{W}_2 \in \mathbb{R}^{d \times 1}$ are the weights and $\mathbf{b}, \mathbf{b}_2$ the biases, $\sigma(.)$ the activation functions.
For simplicity, let's suppose we have identity activation functions $\sigma(t) = t$, no bias terms $\mathbf{b} \equiv 0 \equiv \mathbf{b}_2$, and we are training via mean-squared-error loss.
In the backpropagation step, the gradients of the loss function for a single example w.r.t. $\mathbf{W}_2$ are $$ \nabla_{\mathbf{W}_2} \ (y_i - x_i'\mathbf{W}\mathbf{W}_2)^2 \propto 2x_i'\mathbf{W}(x_i'\mathbf{WW}_2 - y_i) $$ and analogously for $\mathbf{W}$ by the chain rule. We can see that if $\mathbf{W}, \mathbf{W}_2$ are initialized to exactly identical values (not necessarily 0), the gradients w.r.t. each neuron in the weights are the same. We then end up in the same weights for each neuron.
However if the weights were randomly initialized to different values, both the initial values and each gradient step are different across neurons. We then end up with different weights for each neuron.
Here is some simple code to play around with to build some intuition.
Identical initializations of hidden layer:
Random initializations of hidden layer: