I have implemented few fully connected nets. I usually add a column bias units(1s) in the input marix and an extra row of weights in weight matrix because that's how I learned to implement neural nets after taking an online course but in many implementations on github I have found that it can also be implemented without inserting bias units in the matrix but instead it can be added separately : XW + b , where b is bias unit .
I don't understand how it works. It seems like a better and more efficient implementation but I don't understand it. For instance , consider the following example:
1 2 3 4 0.5 0.5
X = 1 4 5 6 W= 2 3 X*W = [4x2 matrix]
1 8 9 5 5 8
2 3
The first column in X is bias unit and so is the first row in W
But if the same is written without directly inserting the bias column but by adding it separately it becomes:
2 3 4 2 3
X= 4 5 6 W= 5 8 b = 0.5 0.5 X*W = [3x2 matrix]
8 9 5 2 3
It can be clearly seen that X*W+b from the second expression is not equal to first expression. And furthermore b, a 1x2 matrix cannot be added to X*W which is 3x2 matrix.
So, how can i implement biases using the second method ?
Let's just carry your first case to its logical conclusion. Let $\hat{X}\in\mathbb{R}^{|D|\times(m+1)}$, $\hat{W}\in\mathbb{R}^{(1+m)\times n}$, ${X}\in\mathbb{R}^{|D|\times m}$, ${W}\in\mathbb{R}^{m\times n}$, $\vec{1}\in\mathbb{R}^{|D|\times 1}$ be the vector of ones, $W_i$ be the $i$th column of $W$, and $b\in\mathbb{R}^{1\times n}$. Then your case 1 looks like: $$ y = \hat{X}\hat{W} = \left[\vec{1}\; X\right]\begin{bmatrix}b\\ W\end{bmatrix}= \left[b_1\vec{1}+XW_1,\ldots,b_n\vec{1}+XW_n\right]=\vec{1}b+XW=B+XW $$ where $B=\vec{1}b=\vec{1}\otimes b\in\mathbb{R}^{|D|\times n}$. (For your example, $|D|=3,m=3,n=2$).
So the equivalent bias to your "first case" has the bias $B$ as a rank 1 matrix, not a vector.
Personally, I'd write it like this. For a single artificial neuron (with weight vector $w=\langle w_1,\ldots,w_n\rangle$), let $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ be the sigmoid function (or e.g. tanh; doesn't matter), input $x_i$, and scalar bias $b$.
Then the output of the node on input $i$ is: $$ y_i = \sigma(w\cdot x_i + b) = \sigma\left(b+\sum_{k=1}^n w_kx_{ik} \right) =\sigma\left(w_01+w_1x_{i1}+\ldots+w_nx_{in} \right)=\sigma(\tilde{w}\cdot\tilde{x}_i) $$ where we renamed $b=w_0$, and let $\tilde{w}=[w_0\;\, w]$, $\tilde{x}_i=[1\;\,x_i]$ by concatenation.