I was studying neural networks and I bumped into a problem. Normally, we identify the weights as such $w_{ij}^{(l)}$, where $i$ is the node number of the connection in layer $l+1$ and $j$ is the node number of the connection in layer $l$.
Thus the matrix for a NN with $4$ nodes in the first hidden layer
$W^{(1)} = \begin{bmatrix} w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)} \\ w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)} \\ w_{31}^{(1)} & w_{32}^{(1)} & w_{33}^{(1)} \\ w_{41}^{(1)} & w_{42}^{(1)} & w_{43}^{(1)} \end{bmatrix}$
Now this is multiplied by the input.
HERE IS THE PROBLEM
Normally the input is represented with the features in the columns, and the samples in the rows, for instance
$X = \begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \\ x_{41} & x_{42} & x_{43} \\ x_{51} & x_{52} & x_{53} \end{bmatrix}$
would tell us that we have $3$ features and $5$ samples.
Normally we always see the formula $WX + \mathbb{b}$, however this would mean
$\begin{bmatrix} w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)} \\ w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)} \\ w_{31}^{(1)} & w_{32}^{(1)} & w_{33}^{(1)} \\ w_{41}^{(1)} & w_{42}^{(1)} & w_{43}^{(1)} \end{bmatrix} \begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \\ x_{41} & x_{42} & x_{43} \\ x_{51} & x_{52} & x_{53} \end{bmatrix}$
which clearly doesn't work. Thus we should actually have $WX^T + b$, right?
$\begin{bmatrix} w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)} \\ w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)} \\ w_{31}^{(1)} & w_{32}^{(1)} & w_{33}^{(1)} \\ w_{41}^{(1)} & w_{42}^{(1)} & w_{43}^{(1)} \end{bmatrix} \begin{bmatrix} x_{11} & x_{21} & x_{31} & x_{41} & x_{51} \\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52} \\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{bmatrix}$
which gives us the correct answer (I think). What is going on??
In neural networks's activation formula you have to do the product of each neuron by its weights. Transposition happens because you have written the X matrix backwards; you wrote:
If you reverse the way you set the matrix, you obtain the transposition. This is not a mistake, everyone does what he wants.
$WX + b = WX^T$ it depens on X matrix'set.
Remember that you have to get first the sum for each neuron in H layer or you can not activate it. Assuming $WX$ the operation below, using question's notation, is the first dot product:
$ h_{11} = w_{11}*x_{11} + w_{12}*x_{12} + w_{13}*x_{13} $
where $h$'s $i$ is the node number in H layer and $j$ is the sample.
Simple rappresentation of your network $i$ = $j$ = sample
Some time ago, I've used tensorflow to solve the Xor problem:
as you can see features are in the columns and the samples in the rows.
In output matrix : sumH matrix, each row is a neuron, each column a sample. Hope to be helpful, if you need some clarifications write a comment. Best regards Marco.