Given $g:R^n \rightarrow R^k$ and $h:R^k \rightarrow R$, we have $f(x) = h(g(x))$.
Using the chain rule, we can differentiate $f(x)$ to get
$f'(x) = \nabla^Th(g(x))g'(x)$
My question is why do we take the transpose of the gradient of $h$? Is it just to make sure the result is a scalar, since $f(x)$ is in $R$?
If so, does it mean that every time we do vector differentiation, we need to ensure the output matches the size of the result, and take transpose if necessary (i.e. no hard and fast rule of taking transpose)?
First, the result isn't an scalar, is a (row) vector. Second, lousy notation. The usual formula for the chain rule is $$D(h\circ g)(x) = Dh(g(x))Dg(x)$$ where the product in the RHS is the matrix product. In your case $\nabla^T h(g(x))$ (i.e., $Dh(g(x))$) is a row vector. See https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant.
EDIT: example with $n = 3$, $k = 2$: $$ \pmatrix{\partial_1 f&\partial_3 f&\partial_3 f} = \pmatrix{\partial_1 h&\partial_2 h} \pmatrix{\partial_1 g_1&\partial_2 g_1&\partial_3 g_1\cr\partial_1 g_2&\partial_2 g_2&\partial_3 g_2}. $$