Implementing bias units in machine learning algorithms

768 Views Asked by At

I have implemented few fully connected nets. I usually add a column bias units(1s) in the input marix and an extra row of weights in weight matrix because that's how I learned to implement neural nets after taking an online course but in many implementations on github I have found that it can also be implemented without inserting bias units in the matrix but instead it can be added separately : XW + b , where b is bias unit .

I don't understand how it works. It seems like a better and more efficient implementation but I don't understand it. For instance , consider the following example:

        1 2 3 4       0.5 0.5
   X =  1 4 5 6    W= 2   3     X*W = [4x2 matrix] 
        1 8 9 5       5   8
                      2   3

The first column in X is bias unit and so is the first row in W

But if the same is written without directly inserting the bias column but by adding it separately it becomes:

      2 3 4       2 3
   X=  4 5 6    W= 5 8    b = 0.5 0.5    X*W = [3x2 matrix]
       8 9 5       2 3

It can be clearly seen that X*W+b from the second expression is not equal to first expression. And furthermore b, a 1x2 matrix cannot be added to X*W which is 3x2 matrix.

So, how can i implement biases using the second method ?

1

There are 1 best solutions below

0
On BEST ANSWER

Let's just carry your first case to its logical conclusion. Let $\hat{X}\in\mathbb{R}^{|D|\times(m+1)}$, $\hat{W}\in\mathbb{R}^{(1+m)\times n}$, ${X}\in\mathbb{R}^{|D|\times m}$, ${W}\in\mathbb{R}^{m\times n}$, $\vec{1}\in\mathbb{R}^{|D|\times 1}$ be the vector of ones, $W_i$ be the $i$th column of $W$, and $b\in\mathbb{R}^{1\times n}$. Then your case 1 looks like: $$ y = \hat{X}\hat{W} = \left[\vec{1}\; X\right]\begin{bmatrix}b\\ W\end{bmatrix}= \left[b_1\vec{1}+XW_1,\ldots,b_n\vec{1}+XW_n\right]=\vec{1}b+XW=B+XW $$ where $B=\vec{1}b=\vec{1}\otimes b\in\mathbb{R}^{|D|\times n}$. (For your example, $|D|=3,m=3,n=2$).

So the equivalent bias to your "first case" has the bias $B$ as a rank 1 matrix, not a vector.


Personally, I'd write it like this. For a single artificial neuron (with weight vector $w=\langle w_1,\ldots,w_n\rangle$), let $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ be the sigmoid function (or e.g. tanh; doesn't matter), input $x_i$, and scalar bias $b$.

Then the output of the node on input $i$ is: $$ y_i = \sigma(w\cdot x_i + b) = \sigma\left(b+\sum_{k=1}^n w_kx_{ik} \right) =\sigma\left(w_01+w_1x_{i1}+\ldots+w_nx_{in} \right)=\sigma(\tilde{w}\cdot\tilde{x}_i) $$ where we renamed $b=w_0$, and let $\tilde{w}=[w_0\;\, w]$, $\tilde{x}_i=[1\;\,x_i]$ by concatenation.