I am trying to understand the layout conventions used in Matrix calculus as described on Wikipedia. For this question I want to assume numerator layout and a "standard" vector to be in column form. So let $f:\mathbb{R^n} \to \mathbb{R}$ twice differentiable, $ \mathbf{x} \in \mathbb{R^n}$ a column vector variable and $f(\mathbf{x})$ also in column form.
By numerator layout we have for the derivative the row vector $$f'(\mathbf{x}) =\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1} & \cdots & \frac{\partial f(\mathbf{x})}{\partial x_n} \end{bmatrix}.$$ Now I want to take a second derivative. My problem comes from that nowhere is being made a distinction in differentiating column and row vectors. Also I never saw a definition of $\frac{\partial^2 f(\mathbf{x})}{\partial \mathbf{x}^2}$, even though I see it used all the time in numerical analysis.
So I believe there are three possibilities:
The operator $\frac{\partial}{\partial \mathbf{x}}$ ...
- ... is indifferent to the format of $f'(\mathbf{x})$. We could think of this as implicitly reconverting $f'(\mathbf{x})$ to a column vector, i.e. $$f''(\mathbf{x}) = \frac{\partial f'(\mathbf{x})}{\partial \mathbf{x}} = \frac{\partial^2 f(\mathbf{x})}{\partial \mathbf{x}^2} =\begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1^2} & \cdots& \frac{\partial f(\mathbf{x})}{\partial x_n\partial x_1} \\ \vdots & & \vdots\\ \frac{\partial f(\mathbf{x})}{\partial x_1 \partial x_n} & \cdots& \frac{\partial f(\mathbf{x})}{\partial x_n^2} \end{bmatrix}.$$ So we get the transposed Hessian of $f(\mathbf{x})$.
- ... is only defined on column vectors, so $\frac{\partial f'(\mathbf{x})}{\partial \mathbf{x}}$ is nonsensical and we should write $$f''(\mathbf{x}) = \frac{\partial}{\partial \mathbf{x}}(\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}})^\top =\begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1^2} & \cdots& \frac{\partial f(\mathbf{x})}{\partial x_n\partial x_1} \\ \vdots & & \vdots\\ \frac{\partial f(\mathbf{x})}{\partial x_1 \partial x_n} & \cdots& \frac{\partial f(\mathbf{x})}{\partial x_n^2} \end{bmatrix}.$$ to get again the transposed Hessian. On Wikipedia I found the notation $\frac{\partial^2 f(\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^\top}$ which might refer to this but I'm not sure.
- ... takes numerator layout literally also for row vectors (or equivalently allow the exchange of transpose with derivative): $$f''(\mathbf{x}) =\frac{\partial^2 f(\mathbf{x})}{\partial \mathbf{x}^2} =\frac{\partial f'(\mathbf{x})}{\partial \mathbf{x}} = (\frac{\partial f'(\mathbf{x})^\top}{\partial \mathbf{x}})^\top =\begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1^2} & \cdots& \frac{\partial f(\mathbf{x})}{\partial x_n\partial x_1} \\ \vdots & & \vdots\\ \frac{\partial f(\mathbf{x})}{\partial x_1 \partial x_n} & \cdots& \frac{\partial f(\mathbf{x})}{\partial x_n^2} \end{bmatrix}^\top = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1^2} & \cdots& \frac{\partial f(\mathbf{x})}{\partial x_1\partial x_n} \\ \vdots & & \vdots\\ \frac{\partial f(\mathbf{x})}{\partial x_n \partial x_1} & \cdots& \frac{\partial f(\mathbf{x})}{\partial x_n^2} \end{bmatrix} $$ So we get the real Hessian.
Note that the orientation of the Hessian is important since it must not be symmetric in my question and that the answer to this is also relevant for consecutive derivatives with respect to different vectors.
To summarize my question: How are the operators $\frac{\partial^2}{\partial \mathbf{x}^2}$ and $\frac{\partial^2}{\partial \mathbf{x} \partial \mathbf{x}^\top}$ defined and which of the above is the convention for row vectors, assuming numerator layout?
The distinction into row / column vectors is honestly more of a notational crutch and even kind of starts to get in the way once you start doing matrix calculus with higher order tensors.
A much better mental model is to think about vectors and dual vectors (bras and kets).
Now the derivative of a function $f:U→V$ at a point $u∈U$ is, by definition, first and foremost a linear map $f(u):U→V, ∆u ↦ f(u)(∆u)$.
It is only on a second inspection that one realizes that linear maps, between finite dimensional vector spaces, can be naturally encoded in tensors, i.e. we can identify $f(u)$ with a tensor $_{f, u}∈V⊗U^*$, such that $f(u)(∆u) = _{f, u}⋅∆u$ for all $∆u∈U$, where
$$_{f, u}⋅∆u = \Big(∑_{i=1}^{r} |vᵢ⟩⟨uᵢ|\Big)|{∆u}⟩$$
Notice that this formula is agnostic about what kind of finite dimensional vector spaces $U$ and $V$ are. They could be simply $ℝⁿ$ and $ℝᵐ$, or they could be higher order tensor-products or even direct sums of higher order tensor-products.
Now, to get to the second derivative, we consider the function
$$ f: U ⟶ V⊗U^*,u ⟼ f(u)$$
Then $²f(u)$ is a linear function $U→V⊗U^*$, which can naturally be encoded in a tensor of the form $(V⊗U^*)⊗U^* ≅ V⊗(U^*)^{⊗2}$.
So, the Hessian, what we call this derivative in the special case $U=ℝⁿ, V=ℝ$, can be considered as a $ℝ⊗((ℝⁿ)^*)^{⊗2}≅ℝ^{n×n}$ tensor. But note that the last isomorphism actually loses some meta-information: it forgets about the fact that it is a matrix that consists purely of dual vectors.
Now to get back to your question: Classically, we would identify column vectors as elements of $ℝⁿ$ and row vectors as elements of $(ℝⁿ)^*$. This makes sense if we talk about the gradient of a scalar values function: if $f:ℝⁿ→ℝ$, then $f(x)$ can be identified with a tensor of the form $ℝ⊗(ℝⁿ)^*≅ (ℝⁿ)^*=$ row vector.
And a regular $m×n$ matrix, which is usually used to model a linear function $ℝⁿ→ℝᵐ$, can be considered as an element of $ℝᵐ⊗(ℝⁿ)^*$, and we see it makes sense to talk about the column and row vectors.
However, in case of Hessian, this connotation break down, because it is a dual ket. Really the way we should be using the Hessian, to make this more clear is to rewrite the 2nd Taylor expansion term as
$$ ½ {∆x}^⊤ f(x) {∆x} = ½⟨f(x)∣ ∆x⊗∆x⟩_{ℝⁿ⊗ℝⁿ}$$
In fact, the general Taylor expansion for a function $f:U→ℝ$ can be written as
$$ f(u+∆u) = ∑_{k=0}^r \tfrac{1}{k!} ⟨^kf(u)∣∆u^{⊗k}⟩_{U^{⊗k}} + \mathcal{R}$$
Where again this formula is nicer than the usual one because it is agnostic about $U$, e.g. it applies without any vectorization to cases where $U$ is more complicated than simply $ℝⁿ$, and we do not lose the meta-information that is lost by vectorizing.
TLDR: The question itself is somewhat ill-conceived. We take derivatives of function, not of vectors. Once you get clarity about what function a vector represents, the shape of the tensor associated with the derivative becomes immediately clear.