Why doesn't optimization result in the same weights for all neurons in the same layer?

Question

Why doesn't optimization result in the same weights for all neurons in the same layer?

87 Views Asked by Bumbble Comm At 26 Mar 2026 - 5:13

I would like to understand why in a single-layer network, given that all neurons use the same training examples (all of them), the gradient descent method does not reach the same solution for the weights of each neuron. In other words, if the equations are identical for the activation function of each neuron, why does the solution converge to different weights for each neuron?

Original Q&A

There are 2 best solutions below

Bumbble Comm On 07 Jun 2023 - 9:12

If you started with all weights set to zero, say, then yes, they should all update exactly the same way and all the weights on a given layer would be the same. In that case, the expressive power of the neural network would be very limited. To break this symmetry, the weights of the network are always initialized randomly before training begins.

**Bumbble Comm** · Accepted Answer

Previous answers/comments have pointed out that this is precisely the reason why random initialization is done in practice. To elaborate a bit further:

Suppose we have $n$ samples with inputs $\mathbf{X} \in \mathbb{R}^{n \times k}$, and a single hidden layer with $d$ neurons, with outputs $\mathbf{Y} \in \mathbb{R}^{n \times 1}$.

Then the neural network $f$ will be performing the mapping: $$ f(\mathbf{X}) = \sigma(\sigma(\mathbf{XW + b})\mathbf{W}_2 + \mathbf{b}_2) $$ where $\mathbf{W} \in \mathbb{R}^{k \times d}$, $\mathbf{W}_2 \in \mathbb{R}^{d \times 1}$ are the weights and $\mathbf{b}, \mathbf{b}_2$ the biases, $\sigma(.)$ the activation functions.

For simplicity, let's suppose we have identity activation functions $\sigma(t) = t$, no bias terms $\mathbf{b} \equiv 0 \equiv \mathbf{b}_2$, and we are training via mean-squared-error loss.

In the backpropagation step, the gradients of the loss function for a single example w.r.t. $\mathbf{W}_2$ are $$ \nabla_{\mathbf{W}_2} \ (y_i - x_i'\mathbf{W}\mathbf{W}_2)^2 \propto 2x_i'\mathbf{W}(x_i'\mathbf{WW}_2 - y_i) $$ and analogously for $\mathbf{W}$ by the chain rule. We can see that if $\mathbf{W}, \mathbf{W}_2$ are initialized to exactly identical values (not necessarily 0), the gradients w.r.t. each neuron in the weights are the same. We then end up in the same weights for each neuron.

However if the weights were randomly initialized to different values, both the initial values and each gradient step are different across neurons. We then end up with different weights for each neuron.

Here is some simple code to play around with to build some intuition.

import numpy as np

# Simulate some data
N = 100
K = 2
X = np.random.normal(size=(N, K))
Y = 1*X[:,[0]] + 2*X[:,[1]]

def optimize_mse(X, Y, W_hidden_init, W_out, lr=0.1, epochs=5):
    N = X.shape[0]
    W_t = W_hidden_init
    for t in range(epochs):
        print(W_t)
        grad = -(1/N) * X.T @ (Y - X @ W_t)
        W_t = W_t - lr*grad
    return W_t

d = 3  # Number of neurons
W_out = np.random.normal(size=(d, 1))

Identical initializations of hidden layer:

W_hidden_init = np.zeros(shape=(K, d))
optimize_mse(X, Y, W_hidden_init, W_out, lr=0.1, epochs=5)
## >> array([[0.40432928, 0.40432928, 0.40432928],
       [0.89144276, 0.89144276, 0.89144276]])

Random initializations of hidden layer:

W_hidden_init = np.random.normal(size=(K, d))
optimize_mse(X, Y, W_hidden_init, W_out, lr=0.1, epochs=5)
## >> array([[0.68902164, 0.67819881, 0.67577952],
       [0.54966617, 1.06033003, 0.15538775]])

Why doesn't optimization result in the same weights for all neurons in the same layer?

There are 2 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions