Example of "The eigenvalues of data covariance matrix, $\Phi^T\Phi$ measure the curvature of the likelihood function."

416 Views Asked by At

I am reading PRML, Chapter 3.5.3, screen shot attached. I can understand the derivation and maths but hard to understand the meaning of "The eigenvalues of data co-variance, $\Phi^T\Phi$ matrix measure the curvature of the likelihood function.". Can you please help me understand by providing an example which can give me the intuitive meaning of the following statement. "The eigenvalues of data co-variance matrix, $\Phi^T\Phi$ measure the curvature of the likelihood function." __________________________________________________________PRML

1

There are 1 best solutions below

2
On BEST ANSWER

The marginal likelihood here is $$ p(t|\alpha,\beta) = \int p(t|w,\beta) p(w|\alpha) \text{d}w = c\int \exp(-E(w))\, \text{d}w $$ So the energy or error $E(w)$ basically determines the likelihood, where \begin{align} E(w) &= \frac{\beta}{2}||t - \Phi w||^2 + \frac{\alpha}{2} w^Tw \\ &= E(m_N) + \frac{1}{2}(w-m_N)^TA(w-m_N) \end{align} where $\Phi$ is the design matrix and $$ A = \alpha I + \beta \Phi^T\Phi = \mathcal{H}[E(w)] = \nabla\nabla E(w) $$ is the Hessian of the error. Notice that the eigenvalues $\lambda_A$ of $A$ are real and positive since it is symmetric positive definite. Notice also that the eigenvalues of $A$ are directly related to those ($\lambda$) of $\Phi^T\Phi$ by a constant scale and shift: $$ Av = \lambda_Av\;\implies\; \Phi^T\Phi v = \frac{1}{\beta}(\lambda_A - \alpha)v = \lambda v $$ In other words, $\Phi^T\Phi$ controls the eigenvalues of the Hessian.

Informally speaking, we view the second derivative (in this case, Hessian) as the curvature of the function. In this case, since $A$ is the Hessian of the error $E(w)$, and the error determines the likelihood, we can reasonably say that $\Phi^T\Phi$ determines the curvature of the likelihood through $A$.

But this can be made more precise. Consider the eigendecomposition of $\Phi^T\Phi = U\Lambda U^T$, where $U$ is orthogonal. Working in the orthonormal basis (rotation of weight space) defined by $U=(u_1,\ldots,u_M)$, we can consider how $E(w)$ looks.

Notice that a level curve of the error in weight space forms an ellipse, with the width of the ellipse in each direction (aligned to some $u_i$) controlled by (proportional to) $\lambda_i$. A wide ellipse (small eigenvalue for that axis) means that a large range of the parameter space spanned by that axis $u$ have very little effect on the error. In other words, the error surface is not very curved (along that axis).

Essentially, smaller eigenvalues mean more contour elongation, which means less curvature (smaller Hessian).

He says:

a smaller curvature corresponds to a greater elongation of the contours of the likelihood function

The relation to the covariance of the (embedded) data somewhat makes sense. Recall that the posterior here is written: \begin{align*} p(w|t) &= \mathcal{N}(w|m_N,S_N) \\ m_N &= \beta S_N \Phi^T t \\ S_N^{-1} &= \alpha I + \beta \Phi^T\Phi = A \end{align*} So in this case, the covariance-like quantity $\Phi^T\Phi$ inversely controls the posterior covariance of the weights. As the eigenvalues $\lambda$ get larger, the covariance of the posterior over weights gets smaller. This means that such parameters are tightly constrained to their mean (as straying too far will lead to a huge increase in error).

Why?! Suppose all your data is clustered around one point. The eigenvalues (data covariance) will then be small. But the posterior will be under constrained then! I.e., there are many parameter values that could let us explain that little cluster. We want many points, spread all over the space, in order to nail down a good function, with low posterior covariance.

So, greater (embedded) data covariance means larger eigenvalues, which means (1) smaller posterior covariance and (2) less elongation, which means larger error surface curvature (i.e., larger Hessian eigenvalues).