Chain rule on matrix differentiation of trace in Kalman gain proof

890 Views Asked by At

I was reading the following derivation of the Kalman filter gain http://www.robots.ox.ac.uk/~ian/Teaching/Estimation/LectureNotes2.pdf on pg 5-6.

From the following

$$L=\min_{K_{k+1}} trace(P_{k+1|k+1})$$

With

$$ P_{k+1|k+1} = \boldsymbol{ (I-K_{k+1} H_{k+1}) P_{k+1|k} (I-K_{k+1} H_{k+1})^{T} + K_{k+1}R_{k+1}K_{k+1}^T }$$

K,H, and P are all nxn matrices.

It uses the following trace differentiation identity:

$${\frac{\partial}{\partial \boldsymbol{A}}} (trace(\boldsymbol{ABA^T})) = 2\boldsymbol{AB}$$

But from there, there is a bit of a skip to the result of the differentiation

$${\frac{\partial L}{\partial \boldsymbol{ K_{k+1}}}} = -2(\boldsymbol{I-K_{k+1}H_{k+1})P_{k+1|k}H_{k+1}^T+2K_{k+1}R_{k+1} } = 0$$

What I am wondering about is the origin of the $H_{k+1}^T$ term in the above equation.

It appears to me that chain rule is applied here. But most online sources on matrix chain rule or trace identities does not show a direct example that the chain rule can be applied to the derivative of a trace.

ie. This is what I think happens in the intermediate step:

$$\boldsymbol{A=I-K_{k+1} H_{k+1}; B=P_{k+1|k}}$$

$${\frac{\partial}{\partial \boldsymbol{K_{k+1}}}} (trace(\boldsymbol{ABA^T})) = 2\boldsymbol{AB} * {\frac{\partial (trace (\boldsymbol{A}))}{\partial \boldsymbol{ K_{k+1}}}}$$

Question: Is the above chain rule application valid? If so, I would appreciate a pointer to a proof.

And then using

$$ {\frac{\partial trace(\boldsymbol{K_{k+1}H_{k+1}} )} {\boldsymbol{\partial K_{k+1}}}} = \boldsymbol{H_{k+1}^T} $$

To arrive at

$$ = -2(\boldsymbol{I-K_{k+1} H_{k+1}) * P_{k+1|k} * H_{k+1}^T}$$

If the intermediate steps is actually different, please let me know and link to any proofs of the identities used as I am self-learning linear algebra. Thanks.

2

There are 2 best solutions below

5
On BEST ANSWER

Firstly $\nabla_A(ABA^T)=A(B+B^T)=2AB$ only if $B$ is symmetric.

Here $L=trace((I-KH)Q(I-KH)^T+KRK^T)$ where $Q,R$ are symmetric matrices. The derivative is $DL_K:Z\rightarrow trace(-ZHQ(I-KH)^T+(I-KH)Q(-ZH)^T+ZRK^T+KRZ^T)=trace(-2ZHQ(I-KH)^T+2ZRK^T)=trace((-2HQ(I-KH)^T+2RK^T)Z)$.

I wrote that follows in at least 10 posts but nobody reads my post (tears). Yet, no students know how to calculate a gradient. Incredible !

A gradient is associated to a scalar product; we take, here, $(U,V)=trace(U^TV)$. The associated gradient is defined by, for every $Z$, $(\nabla_K(L),Z)=DL_K(Z)$. Thus $\nabla_K(L)=-2(I-KH)QH^T+2KR$.

EDIT (answers to frank): 1),2) $DL_K$ is a linear application and $Z$ is the variable. For instance, if $f:K\rightarrow KAK$, then $Df_K:Z\rightarrow ZAK+KAZ$, the derivative of a product ! ($(uv)'=u'v+uv'$). If $f:K\rightarrow -KHQ$, then $Df_K:Z\rightarrow -ZHQ$. $(ua)'=u'a$.

3) Here $Df_X:(h,k,l)\rightarrow [3z,4y^3,3x][h,k,l]^T$. We use the scalar product over vectors $(u,v)=trace(u^Tv)=u^Tv$. Then $\nabla_Xf=[3z,4y^3,3x]^T$. More generally, to define a gradient with the help of a scalar product has a geometric meaning. You do not need a basis.

4) The derivative uses trace and the gradient not. It is linked to the chosen scalar product.

5) Consider the application $f:K\in\mathcal{M}_n(\mathbb{R})\rightarrow KAK\in\mathcal{M}_n(\mathbb{R})$, then the derivative is the linear application $Df_K:Z\in\mathcal{M}_n(\mathbb{R})\rightarrow ZAK+KAZ\in\mathcal{M}_n(\mathbb{R})$. In the same way, if $g(K)=K^3$, then $Dg_K(Z)=ZK^2+KZK+K^2Z$. If you choose coordinates, then the matrix of the derivative is its Jacobian. For 3), $f:\mathbb{R^3}\rightarrow \mathbb{R}$ and let $X=[x,y,z]^T$. $Df_X$ is a linear application $(h,k,l)\in\mathbb{R^3}\rightarrow \mathbb{R}$ and , consequently, its matrix (the Jacobian) is $1\times 3$, that is a row. The gradient is the transpose of the previous matrix, that is a vector.

2
On

$ \def\bb{\mathbb} \def\tss{\mathop{\bigotimes}\limits} \def\dss{\mathop{\odot}\limits} $ For ease of typing, drop the elaborate subscripts and define the matrices $$\eqalign{ &H = H_{k+1} \qquad &P = P_{k+1|k} \qquad &R = R_{k+1} \\ &X = K_{k+1} \qquad &Y = XH-I \qquad &dY = dX\,H \\ }$$ Then the problem is to calculate the gradient of the function $$\eqalign{ L &= {\rm Tr}\Big(XRX^T + YPY^T\Big) \\ &= R:X^TX + P:Y^TY \\ }$$ where a colon denotes the trace/Frobenius product, i.e. $$\eqalign{ A:B = {\rm Tr}(A^TB) = {\rm Tr}(B^TA) = B:A \\ }$$ Calculate the differential, then the gradient. $$\eqalign{ dL &= R:(dX^TX+X^TdX) + P:(dY^TY+Y^TdY) \\ &= (R+R^T):X^TdX + (P+P^T):Y^TdY \\ &= X(R+R^T):dX + Y(P+P^T):dY \\ &= X(R+R^T):dX + (XH-I)(P+P^T):dX\,H \\ &= \Big(X(R+R^T) + (XH-I)(P+P^T)H^T\Big):dX \\ \frac{\partial L}{\partial X} &= X(R+R^T) + (XH-I)(P+P^T)H^T \\ }$$ If $R$ and $P$ are symmetric, then this can be simplified to the given result $$\eqalign{ \frac{\partial L}{\partial X} &= 2XR + 2(XH-I)PH^T \\ }$$