Stochastic Gradient Descent Batch Size effects

79 Views Asked by At

I am trying to understand stochastic gradient descent used for residues with norm 2. Let's say we are trying to minimize $||Ax − b||^2_2$. Now we need to decide on a batch size and epoch lenght. Batch size is the number of rows of x we try to calculate per epoch. So for mini batch size 1, for each epoch, we want to solve $a_i^Tx=b_i$.

However, I am really confused about this part. What I understand intuitively is the data vectors are the rows of A and we want good x's, weights, to "move" those a's as close as possible to b's. When we decide on a batch size, for example 1, do we mean A is just a row vector with only 1 data for each epoch and and picked randomly? So a is just a random row from A? And what happens when we increase the batch size? Let's say our batch size is 2, is $a$ a $2xn$ matrix now, because each epoch we need to deal with 2 data vectors? And how x is changed accordingly to change of batch size? I really don't understand the effects of setting a batch size mathematically, I want to develop it further than mere intuition. Thanks a lot.

1

There are 1 best solutions below

5
On BEST ANSWER

I think you have understood it wrong. S.G.D works the following way:

Let's say you have $m$ examples and Loss is MSE, $$\ L(w) = \sum_{i=1}^m (w^Tx_i +b - y_i)^2 \quad \text{where } w,b \text{ parameters to be learnt} $$ Let's see what the gradient looks like w.r.t $w$, \begin{align} \nabla_w L=2\sum_{i=1}^m (w^Tx_i +b - y_i)x_i \end{align} We see that we may have to sum over a lot of values of large $m$ to compute just the gradient. Remember we have to calculate the gradient in each iteration of Gradient Descent. To make it faster, we rather compute stochastic gradient: $$\ \tilde\nabla_w L=\sum_{i=1}^B (w^T\tilde x_i +b - \tilde y_i)\tilde x_i $$ where $\tilde x_i = x_j$ and $\tilde y_i = y_j$ for a random $j$ and $B$ is the Batch Size. Randomness is there just to make sure that stochastic gradient is unbiased i.e. $E[\tilde\nabla_w L]=\nabla_w L$. Other than this, you proceed like Gradient Descent.

Note: By each iteration of G.D I mean epoch. In each epoch, you compute the Stochastic Gradient. Some sophisticated methods like SVRG compute a full gradient once in between many epochs in order to reduce the variance.