I was reading the following derivation of the Kalman filter gain http://www.robots.ox.ac.uk/~ian/Teaching/Estimation/LectureNotes2.pdf on pg 5-6.
From the following
$$L=\min_{K_{k+1}} trace(P_{k+1|k+1})$$
With
$$ P_{k+1|k+1} = \boldsymbol{ (I-K_{k+1} H_{k+1}) P_{k+1|k} (I-K_{k+1} H_{k+1})^{T} + K_{k+1}R_{k+1}K_{k+1}^T }$$
K,H, and P are all nxn matrices.
It uses the following trace differentiation identity:
$${\frac{\partial}{\partial \boldsymbol{A}}} (trace(\boldsymbol{ABA^T})) = 2\boldsymbol{AB}$$
But from there, there is a bit of a skip to the result of the differentiation
$${\frac{\partial L}{\partial \boldsymbol{ K_{k+1}}}} = -2(\boldsymbol{I-K_{k+1}H_{k+1})P_{k+1|k}H_{k+1}^T+2K_{k+1}R_{k+1} } = 0$$
What I am wondering about is the origin of the $H_{k+1}^T$ term in the above equation.
It appears to me that chain rule is applied here. But most online sources on matrix chain rule or trace identities does not show a direct example that the chain rule can be applied to the derivative of a trace.
ie. This is what I think happens in the intermediate step:
$$\boldsymbol{A=I-K_{k+1} H_{k+1}; B=P_{k+1|k}}$$
$${\frac{\partial}{\partial \boldsymbol{K_{k+1}}}} (trace(\boldsymbol{ABA^T})) = 2\boldsymbol{AB} * {\frac{\partial (trace (\boldsymbol{A}))}{\partial \boldsymbol{ K_{k+1}}}}$$
Question: Is the above chain rule application valid? If so, I would appreciate a pointer to a proof.
And then using
$$ {\frac{\partial trace(\boldsymbol{K_{k+1}H_{k+1}} )} {\boldsymbol{\partial K_{k+1}}}} = \boldsymbol{H_{k+1}^T} $$
To arrive at
$$ = -2(\boldsymbol{I-K_{k+1} H_{k+1}) * P_{k+1|k} * H_{k+1}^T}$$
If the intermediate steps is actually different, please let me know and link to any proofs of the identities used as I am self-learning linear algebra. Thanks.
Firstly $\nabla_A(ABA^T)=A(B+B^T)=2AB$ only if $B$ is symmetric.
Here $L=trace((I-KH)Q(I-KH)^T+KRK^T)$ where $Q,R$ are symmetric matrices. The derivative is $DL_K:Z\rightarrow trace(-ZHQ(I-KH)^T+(I-KH)Q(-ZH)^T+ZRK^T+KRZ^T)=trace(-2ZHQ(I-KH)^T+2ZRK^T)=trace((-2HQ(I-KH)^T+2RK^T)Z)$.
I wrote that follows in at least 10 posts but nobody reads my post (tears). Yet, no students know how to calculate a gradient. Incredible !
A gradient is associated to a scalar product; we take, here, $(U,V)=trace(U^TV)$. The associated gradient is defined by, for every $Z$, $(\nabla_K(L),Z)=DL_K(Z)$. Thus $\nabla_K(L)=-2(I-KH)QH^T+2KR$.
EDIT (answers to frank): 1),2) $DL_K$ is a linear application and $Z$ is the variable. For instance, if $f:K\rightarrow KAK$, then $Df_K:Z\rightarrow ZAK+KAZ$, the derivative of a product ! ($(uv)'=u'v+uv'$). If $f:K\rightarrow -KHQ$, then $Df_K:Z\rightarrow -ZHQ$. $(ua)'=u'a$.
3) Here $Df_X:(h,k,l)\rightarrow [3z,4y^3,3x][h,k,l]^T$. We use the scalar product over vectors $(u,v)=trace(u^Tv)=u^Tv$. Then $\nabla_Xf=[3z,4y^3,3x]^T$. More generally, to define a gradient with the help of a scalar product has a geometric meaning. You do not need a basis.
4) The derivative uses trace and the gradient not. It is linked to the chosen scalar product.
5) Consider the application $f:K\in\mathcal{M}_n(\mathbb{R})\rightarrow KAK\in\mathcal{M}_n(\mathbb{R})$, then the derivative is the linear application $Df_K:Z\in\mathcal{M}_n(\mathbb{R})\rightarrow ZAK+KAZ\in\mathcal{M}_n(\mathbb{R})$. In the same way, if $g(K)=K^3$, then $Dg_K(Z)=ZK^2+KZK+K^2Z$. If you choose coordinates, then the matrix of the derivative is its Jacobian. For 3), $f:\mathbb{R^3}\rightarrow \mathbb{R}$ and let $X=[x,y,z]^T$. $Df_X$ is a linear application $(h,k,l)\in\mathbb{R^3}\rightarrow \mathbb{R}$ and , consequently, its matrix (the Jacobian) is $1\times 3$, that is a row. The gradient is the transpose of the previous matrix, that is a vector.