Below we have the least squares objective function in vector form. Assume dimension of $x$ is $m \times n$, $W$ is $n \times 1$ and $y$ is $m \times 1$. So $J$ is $1 \times 1$ a scalar. $T$ below is the transpose.
$$ J=\frac{1}{m}(xW -y)^{T}(xW-y) $$
Let $A =xW-y$, then $$ J=\frac{1}{m}A^{T}A$$ $$\frac{dJ}{dA}=\frac{2}{m}A$$ $$\frac{dA}{dW}=x .$$
By chain rule:
$$ \frac{dJ}{dw}=\frac{dJ}{dA}\frac{dA}{dW}=\frac{2}{m}Ax .$$
Obviously if we do this, the dimensions don't match up, since $A$ is $m \times 1$ and $x$ is $m \times n$. Instead, the answer should be $$\frac{dJ}{dw}=\frac{2}{m}x^{T}A .$$
My question is, how do we know the convention in this case? Can someone point me to how we should do the chain rule in this case?
For differential calculus involving matrices it is more convenient to use differentials instead of derivatives. You can find the rules for differentials in Practical Guide to Matrix Calculus for Deep Learning by Andrew Delong or in the book "Matrix algebra" by Abadir & Magnus.
I will show you how to get the derivative of $J$ using notation and rules from the paper by Delong. Delong's paper doesn't have the chain rule for differentials but it is $$d(f\circ g)(x, dx)=df(g(x),dg(x,dx))$$ You can look Composite function gradient for a proof sketch.
So using the rules (10), (12) from the paper $$dJ(A,dA)=d(\frac{1}{m}A^TA)=\frac{1}{m}dA^TA+\frac{1}{m}A^TdA=\frac{1}{m}(dA)^TA+\frac{1}{m}A^TdA=\frac{2}{m}A^TdA$$ $$dA(x,dx)=d(xW-y)=dxW$$ $$d(A(W),dW)=d(xW-y)=xdW$$ By the chain rule, the rule (6) and the fact that $v^Tu=v\cdot u$ for vectors $v,u$ $$dJ(x,dx)=dJ(A(x),dA(x,dx))=\frac{2}{m}A(x)^TdA(x,dx)=\frac{2}{m}(xW-y)^TdxW=$$ $$=\frac{2}{m}(xW-y)\cdot dxW=\frac{2}{m}(xW-y)W^T\cdot dx$$ $$dJ(W,dW)=J(A(W),dA(W,dW))=\frac{2}{m}(xW-y)\cdot xdW=\frac{2}{m}x^T(xW-y)\cdot dW$$ From these and the rule (17) $$\frac{\partial J}{\partial x}=\frac{2}{m}(xW-y)W^T$$ $$\frac{\partial J}{\partial W}=\frac{2}{m}x^T(xW-y)$$