Verification of Bishop C.22 theorem (theorem proof using eigenvectors)

309 Views Asked by At

I've been independently working through Bishop's machine learning textbook (Pattern Recognition and Machine Learning, 2006 link), and at the end of Appendix C (Properties of Matrices) there's a proposition (C.22) stated as such (on page 701), where $x$ is some scalar, $\boldsymbol{A}$ is a square non-degenerate matrix, $\ln$ is the natural logarithm, and $\mbox{Tr}$ is the usual trace function associated with matricies ($\mbox{Tr}(\boldsymbol{A})=\sum A_{ii}$):

$\frac{\delta}{\delta x} \ln |\boldsymbol{A}| = \mbox{Tr} \left( \boldsymbol{A}^{-1} \frac{\delta \boldsymbol{A}}{\delta x}\right)$

It is stated that C.22 can be verified with some previously derived theorems, and then promptly left as an exercise to the reader. After spending a few hours, on it, checking the errata, trawling online for a proof (although my google-fu could be too weak: it turns out typing "trace log derivative discriminant inverse proof" is not the best idea), and then enlisting a friend for an evening, I'm still missing a final (?) step.

First, the previously derived theorems (mentioned in the text as C.33, C.45, C.46, C.47), where $\lambda_i$ is the $i$-th eigenvalue of the matrix $\boldsymbol{A}$, $\boldsymbol{u}_i$ is the associated eigenvector, $x^T$ is the transpose operator, and $|\boldsymbol{A}|$ is the determinant operator applied to matrix $\boldsymbol{A}$:

(C.33) $\boldsymbol{u}_i^T \boldsymbol{u}_j = I_{ij}$ (where $I$ is the identity matrix)

(C.45) $\boldsymbol{A} = \sum \lambda_i \boldsymbol{u}_i \boldsymbol{u}_i^T$

(C.46) $\boldsymbol{A}^{-1} = \sum \frac{1}{\lambda_i} \boldsymbol{u}_i \boldsymbol{u}_i^T$

(C.47) $|\boldsymbol{A}| = \prod \lambda_i$

And then my derivation thus far, starting with the left side:

$\frac{\delta}{\delta x} \ln\left(|\boldsymbol{A}|\right) = \frac{\delta}{\delta x} \ln \left(\prod \lambda_i \right)$ (via C.47)

$= \frac{\delta}{\delta x} \sum \ln \lambda_i$

$= \sum \frac{\delta}{\delta x} \ln \lambda_i$

$= \sum \frac{1}{\lambda_i} \frac{\delta \lambda_i}{\delta x}$ (applying the chain rule)

Switching to the right side:

$\mbox{Tr}\left( \boldsymbol{A}^{-1} \frac{\delta}{\delta x} \boldsymbol{A} \right)$ $= \mbox{Tr}\left( \left(\sum_i \frac{1}{\lambda_i} \boldsymbol{u}_i \boldsymbol{u}_i^T \right) \frac{\delta}{\delta x} \left(\sum_j \lambda_j \boldsymbol{u}_j \boldsymbol{u}_j^T \right) \right)$ (via C.45, C.46)

$= \mbox{Tr}\left(\sum_i \sum_j \frac{1}{\lambda_i} \boldsymbol{u}_i \boldsymbol{u}_i^T \frac{\delta}{\delta x} \left( \lambda_j \boldsymbol{u}_j \boldsymbol{u}_j^T \right) \right)$

$= \mbox{Tr}\left(\sum_i \sum_j \frac{1}{\lambda_i} \boldsymbol{u}_i \boldsymbol{u}_i^T \left( \frac{\delta \lambda_j}{\delta x} \boldsymbol{u}_j \boldsymbol{u}_j^T + \lambda_j \frac{\delta \boldsymbol{u}_j \boldsymbol{u}_j^T}{\delta x} \right) \right)$ (via the product rule)

$= \mbox{Tr}\left( \left( \sum_i \sum_j \frac{1}{\lambda_i} \frac{\delta \lambda_j}{\delta x} \boldsymbol{u}_i \boldsymbol{u}_i^T \boldsymbol{u}_j \boldsymbol{u}_j^T \right) + \left( \sum_i \sum_j \frac{\lambda_j}{\lambda_i} \boldsymbol{u}_i \boldsymbol{u}_i^T \frac{\delta \boldsymbol{u}_j \boldsymbol{u}_j^T}{\delta x} \right) \right)$

$= \mbox{Tr}\left( \left( \sum_i \frac{1}{\lambda_i} \frac{\delta \lambda_i}{\delta x} \boldsymbol{u}_i \boldsymbol{u}_i^T \right) + \left( \sum_i \sum_j \frac{\lambda_j}{\lambda_i} \boldsymbol{u}_i \boldsymbol{u}_i^T \frac{\delta \boldsymbol{u}_j \boldsymbol{u}_j^T}{\delta x} \right) \right)$ (via C.33)

Now, I can see the end game: if only I could throw away that double-sum from the right side of the equation ($\sum_i \sum_j \frac{\lambda_j}{\lambda_i} \boldsymbol{u}_i \boldsymbol{u}_i^T \frac{\delta \boldsymbol{u}_j \boldsymbol{u}_j^T}{\delta x}$), then I could apply C.47 and be done with the verification. However, I can't see a way to eliminate it, and attempts to prove it equal to zero haven't borne fruit.

Help/pointers/frustratingly vague haikus crafted for mathematical enlightenment would be appreciated. Help classifying/titling the question would also be nice.

3

There are 3 best solutions below

2
On BEST ANSWER

You've done the left hand side. I'll do the right hand side.

Let $A_x: (a,b) \rightarrow \mathbb{R}^{n^2}$ such that $A_x$ is differentiable and self-adjoint when viewed as a matrix.

By the spectral theorem, we may find functions $Q_x, \Lambda_x$ where
$Q_x$ and $\Lambda_x$ are differentiable
$Q_x$ orthonormal
$\Lambda_x$ diagonal
$A_x = Q_x \Lambda_x Q_x^T$

Luckily the product rule for derivatives still holds under matrix multiplication: $A_x^{ \prime} = Q_x^{\prime} \Lambda_x Q_x^T + Q_x \Lambda_x^{\prime} Q_x^T + Q_x \Lambda_x Q_x^{T\prime}$

It is not hard to show that $A_x^{-1}= Q_x \Lambda_x^{-1} Q_x^T$ where the inversion is matrix inversion as opposed to function inversion.

Putting it together, we get $A_x^{-1} A_x^{ \prime} = Q_x \Lambda_x^{-1} Q_x^T( Q_x^{\prime} \Lambda_x Q_x^T + Q_x \Lambda_x^{\prime} Q_x^T + Q_x \Lambda_x Q_x^{T\prime})$ which is $Q_x \Lambda_x^{-1} Q_x^T Q_x^{\prime} \Lambda_x Q_x^T + Q_x \Lambda_x^{-1} \Lambda_x^{\prime} Q_x^T + Q_x Q_x^{T\prime}$

At this point it seems we are stuck, but applying the trace operator, and remembering that trace is invariant under under cyclic permutations and similarity, we get $$\text{Tr}(Q_x \Lambda_x^{-1} Q_x^T Q_x^{\prime} \Lambda_x Q_x^T) + \text{Tr}(Q_x \Lambda_x^{-1} \Lambda_x^{\prime} Q_x^T) + \text{Tr}(Q_x Q_x^{T\prime}) = $$

$$ \text{Tr}( Q_x^T Q_x^{\prime} ) + \text{Tr}(\Lambda_x^{-1} \Lambda_x^{\prime}) + \text{Tr}(Q_x^{T\prime} Q_x ) = $$

$$ \text{Tr}( Q_x^T Q_x^{\prime} + Q_x^{T\prime}Q_x) + \text{Tr}(\Lambda_x^{-1} \Lambda_x^{\prime}) =$$ $$\text{Tr}( (Q_x^T Q_x)^{\prime}) + \text{Tr}(\Lambda_x^{-1} \Lambda_x^{\prime}) =$$

$$\text{Tr}( I^{\prime}) + \text{Tr}(\Lambda_x^{-1} \Lambda_x^{\prime}) =$$ $$\text{Tr}( 0) + \text{Tr}(\Lambda_x^{-1} \Lambda_x^{\prime}) =$$ $$\text{Tr}(\Lambda_x^{-1} \Lambda_x^{\prime})$$

0
On

This answer uses notational conventions and established results in Bishop's book and continues with where the asker was stuck so that we can see how "to verify (C.22) by making use of the results (C.33), (C.45), (C.46), and (C.47)." I think this is the intentioned proof of Bishop. This book is widely adopted in machine learning community so I will assume the reader has a copy of it and I will use its contents without excerpt.

The asker has arrived at $$\operatorname{Tr}\left(\mathbf A^{-1}\frac{\partial\mathbf A}{\partial x}\right)=\operatorname{Tr}\left[\sum\limits_{i,j}\left(\frac{1}{\lambda_i}\mathbf u_i\mathbf u_i^T\right)\frac{\partial}{\partial x}(\lambda_j\mathbf u_i\mathbf u_i^T)\right],$$ but s/he failed to expand $\frac{\partial}{\partial x}(\lambda_j\mathbf u_i\mathbf u_i^T)$ completely. In Bishop's book, the matrix-to-scalar derivatives are given by a so-called denominator layout notation. That is, the partial derivatives of matrix elements reside at the same position in the resulting matrix, keeping the matrix shape unchanged. This notation facilitates subsequent matrix multiplications. Equation (C.20) gives a product rule using such a notation in the context of two matrices and vector variable. Here, we reduce the variable to scalar $x$ and extend the number of matrices being multiplied to an arbitrary number $n$. By induction based on (C.20), we have $$\frac{\partial}{\partial x}(\mathbf A_1\mathbf A_2\cdots\mathbf A_n)=\frac{\partial\mathbf A_1}{\partial x}\mathbf A_2\cdots\mathbf A_n+\mathbf A_1\frac{\partial\mathbf A_2}{\partial x}\mathbf A_3\cdots\mathbf A_n+\cdots+\mathbf A_1\mathbf A_2\cdots\frac{\partial\mathbf A_n}{\partial x}$$ where $\frac{\partial\mathbf A}{\partial x}$ is matrix $\left(\frac{\partial A_{ij}}{\partial x}\right)$ as explained above. This equation holds even if the matrix factors are scalars or vectors.

Applying the above general product rule in matrix form, we get $$\frac{\partial}{\partial x}(\lambda_j\mathbf u_j\mathbf u_j^T)=\frac{\partial\lambda_j}{\partial x}\mathbf u_j\mathbf u_j^T+\lambda_j\frac{\partial\mathbf u_j}{\partial x}\mathbf u_j^T+\lambda_j\mathbf u_j\frac{\partial\mathbf u_j^T}{\partial x}.$$ PS, we could use this general product rule to prove the left side of (C.22) using $|A|$ as intermediate variable.

So, the trace is equal to $$\operatorname{Tr}\left[\sum\limits_{i,j}\left(\frac{1}{\lambda_i}\mathbf u_i\mathbf u_i^T\right)\left(\frac{\partial\lambda_j}{\partial x}\mathbf u_j\mathbf u_j^T+\lambda_j\frac{\partial\mathbf u_j}{\partial x}\mathbf u_j^T+\lambda_j\mathbf u_j\frac{\partial\mathbf u_j^T}{\partial x}\right)\right].$$

Because eigenvalues $\{\mathbf u_i\}$ are orthogonal, only those products where $i=j$ can survive. So, the double sum $\sum\limits_{i,j}$ is reduced to single sum $\sum\limits_i$. Here we used cyclic property of trace for the second term of the right factor $\operatorname{Tr}(\mathbf u_i\mathbf u_i^T\frac{\partial\mathbf u_j}{\partial x}\mathbf u_j^T)=\operatorname{Tr}(\mathbf u_i^T\frac{\partial\mathbf u_j}{\partial x}\mathbf u_j^T\mathbf u_i)$ to eliminate $i\ne j$ products.

Now the trace becomes $$\operatorname{Tr}\left[\sum\limits_i\left(\frac{1}{\lambda_i}\mathbf u_i\mathbf u_i^T\right)\left(\frac{\partial\lambda_i}{\partial x}\mathbf u_i\mathbf u_i^T+\lambda_i\frac{\partial\mathbf u_i}{\partial x}\mathbf u_i^T+\lambda_i\mathbf u_i\frac{\partial\mathbf u_i^T}{\partial x}\right)\right]$$ $$=\operatorname{Tr}\left[\sum\limits_i\left(\frac{1}{\lambda_i}\frac{\partial\lambda_i}{\partial x}\mathbf u_i\mathbf u_i^T\mathbf u_i\mathbf u_i^T+\mathbf u_i\mathbf u_i^T\frac{\partial\mathbf u_i}{\partial x}\mathbf u_i^T+\mathbf u_i\mathbf u_i^T\mathbf u_i\frac{\partial\mathbf u_i^T}{\partial x}\right)\right]$$ $$=\sum\limits_i\frac{1}{\lambda_i}\frac{\partial\lambda_i}{\partial x}\operatorname{Tr}(\mathbf u_i\mathbf u_i^T)+\sum\limits_i\operatorname{Tr}\left(\mathbf u_i^T\frac{\partial\mathbf u_i}{\partial x}\mathbf u_i^T\mathbf u_i+\mathbf u_i\frac{\partial\mathbf u_i^T}{\partial x}\right).$$ In the last equality, the first term results from the fact that eigenvalues $\{\mathbf u_i\}$ are orthonormal so $\mathbf u_i^T\mathbf u_i=1$ (Eq (C.33)). The second term employed cyclic property of trace again as well as linearity of trace so it can penetrate sum. Applying orthonormality again, the trace reduces to $$\sum\limits_i\frac{1}{\lambda_i}\frac{\partial\lambda_i}{\partial x}\operatorname{Tr}(\mathbf u_i\mathbf u_i^T)+\sum\limits_i\operatorname{Tr}\left(\mathbf u_i^T\frac{\partial\mathbf u_i}{\partial x}+\mathbf u_i\frac{\partial\mathbf u_i^T}{\partial x}\right).$$

$\operatorname{Tr}(\mathbf u_i\mathbf u_i^T)=1$ due to again orthonormality of eigenvectors. For the second sum of above result, $$\operatorname{Tr}\left(\mathbf u_i\frac{\partial\mathbf u_i^T}{\partial x}\right)=\operatorname{Tr}\left(\frac{\partial\mathbf u_i^T}{\partial x}\mathbf u_i\right)$$ from cyclic property of trace again. The desired trace is now $$\sum\limits_i\frac{1}{\lambda_i}\frac{\partial\lambda_i}{\partial x}+\sum\limits_i\operatorname{Tr}\left(\mathbf u_i^T\frac{\partial\mathbf u_i}{\partial x}+\frac{\partial\mathbf u_i^T}{\partial x}\mathbf u_i\right).$$

The first sum is exactly the left side that the asker has derived. Next we eliminate the second one. If we apply the product rule for matrix (C.20) inversely, we see $$\eqalign{ \mathbf u_i^T\frac{\partial\mathbf u_i}{\partial x}+\frac{\partial\mathbf u_i^T}{\partial x}\mathbf u_i&=\frac{\partial}{\partial x}(\mathbf u_i^T\mathbf u_i)\\ &=\frac{\partial 1}{\partial x}\\ &=0.}$$

The trace of $0$ is also $0$. Thus the proof is concluded in Bishop's style.

0
On

There is no need to use eigenvectors. You can use the result demonstrated here mathstack and write the differential $$ d |\mathbf{A}| = |\mathbf{A}| \mathbf{A}^{-T}:d\mathbf{A} $$

It follows $$ d \ln|\mathbf{A}| = \frac{1}{|\mathbf{A}|} d|\mathbf{A}| = \mathbf{A}^{-T}:d\mathbf{A} = \left[\mathbf{A}^{-T}:\frac{\partial \mathbf{A}}{\partial x} \right] dx $$ since $d\mathbf{A} = \frac{\partial \mathbf{A}}{\partial x} dx$. We deduce $$ \frac{\partial \ln|\mathbf{A}|}{\partial x} = \mathbf{A}^{-T}:\frac{\partial \mathbf{A}}{\partial x} = \mathrm{tr} \left[\mathbf{A}^{-1}\frac{\partial \mathbf{A}}{\partial x} \right] $$

Here the colon operator indicates the Frobenius inner product and the relation $\mathbf{A}:\mathbf{B}=\mathrm{tr}(\mathbf{A}^T\mathbf{B})$.