How to know when and where to transpose?

69 Views Asked by At

I was reading a great blog post about Fisher Information and I cam upon the part where the Fisher matrix is defined as the variance of the score...

$$ \begin{aligned} Var[x] = \mathbb{E}[x^2] - \mathbb{E}[x]^2 \implies \mathbb{E}[\nabla \log p(x|\theta) \nabla \log p(x|\theta)^\top] - 0 \end{aligned} $$

I couldn't explain something basic to myself while reading this. Why is the generalization of multiplication done by the operation $AA^\top$ in the expectation? Why couldn't it be $A^\top A$ or even $AA$?

1

There are 1 best solutions below

1
On BEST ANSWER

First, here $\nabla\log p(x\mid\theta)$ is a vector but NOT a matrix. When calculating the variance-covariance matrix, only vectors are involved. Of course we cannot multiply a vector by itself.

In multivariate case, a random vector is defined as ${\bf X}=(X_1,\ldots,X_n)^T$, where $X_1,\ldots,X_n$ are all random variables, though not necessarily independent. If you take ${\bf X}^T{\bf X}$, you get only a scalar. All the information are lost.

Then why ${\bf X}{\bf X^T}$ works? For $i,j=1,\ldots,n$, we have \begin{align*} \operatorname{Var}(X_i)&=\operatorname{\Bbb E}(X_i^2)-\operatorname{\Bbb E}(X_i)^2, \\ \operatorname{Cov}(X_i,X_j)&=\operatorname{\Bbb E}(X_iX_j)-\operatorname{\Bbb E}(X_i)\operatorname{\Bbb E}(X_j). \end{align*} Those two holds generally without any special assumption. Now let us take a look at ${\bf X}{\bf X^T}$: $${\bf X}{\bf X^T}=\begin{bmatrix} X_1^2 & X_1X_2 & \cdots & X_1X_n \\ X_2X_1 & X_2^2 & \cdots & X_2X_n \\ \vdots & \vdots & & \vdots \\ X_nX_1 & X_nX_2 & \cdots & X_n^2 \end{bmatrix}, $$ so $$\operatorname{\Bbb E}({\bf X}{\bf X^T})=\begin{bmatrix} \operatorname{\Bbb E}(X_1^2) & \operatorname{\Bbb E}(X_1X_2) & \cdots & \operatorname{\Bbb E}(X_1X_n) \\ \operatorname{\Bbb E}(X_2X_1) & \operatorname{\Bbb E}(X_2^2) & \cdots & \operatorname{\Bbb E}(X_2X_n) \\ \vdots & \vdots & & \vdots \\ \operatorname{\Bbb E}(X_nX_1) & \operatorname{\Bbb E}(X_nX_2) & \cdots & \operatorname{\Bbb E}(X_n^2) \end{bmatrix}.$$ On the other hand, $$\operatorname{\Bbb E}({\bf X})=\begin{bmatrix} \operatorname{\Bbb E}(X_1) & \operatorname{\Bbb E}(X_2) & \cdots & \operatorname{\Bbb E}(X_n) \end{bmatrix}^T,$$ thus $$\operatorname{\Bbb E}({\bf X})\operatorname{\Bbb E}({\bf X})^T=\begin{bmatrix} \operatorname{\Bbb E}(X_1)^2 & \operatorname{\Bbb E}(X_1)\operatorname{\Bbb E}(X_2) & \cdots & \operatorname{\Bbb E}(X_1X_n) \\ \operatorname{\Bbb E}(X_2)\operatorname{\Bbb E}(X_1) & \operatorname{\Bbb E}(X_2)^2 & \cdots & \operatorname{\Bbb E}(X_2)\operatorname{\Bbb E}(X_n) \\ \vdots & \vdots & & \vdots \\ \operatorname{\Bbb E}(X_n)\operatorname{\Bbb E}(X_1) & \operatorname{\Bbb E}(X_n)\operatorname{\Bbb E}(X_2) & \cdots & \operatorname{\Bbb E}(X_n)^2 \end{bmatrix}.$$ The difference between these two matrix is \begin{align*} &\operatorname{\Bbb E}({\bf X}{\bf X^T})-\operatorname{\Bbb E}({\bf X})\operatorname{\Bbb E}({\bf X})^T \\ &=\begin{bmatrix} \operatorname{\Bbb E}(X_1)^2-\operatorname{\Bbb E}(X_1)^2 & \operatorname{\Bbb E}(X_1)\operatorname{\Bbb E}(X_2)-\operatorname{\Bbb E}(X_1)\operatorname{\Bbb E}(X_2) & \cdots & \operatorname{\Bbb E}(X_1X_n)-\operatorname{\Bbb E}(X_1X_n) \\ \operatorname{\Bbb E}(X_2X_1)-\operatorname{\Bbb E}(X_2)\operatorname{\Bbb E}(X_1) & \operatorname{\Bbb E}(X_2^2)-\operatorname{\Bbb E}(X_2)^2 & \cdots & \operatorname{\Bbb E}(X_2X_n)-\operatorname{\Bbb E}(X_2)\operatorname{\Bbb E}(X_n) \\ \vdots & \vdots & & \vdots \\ \operatorname{\Bbb E}(X_nX_1)-\operatorname{\Bbb E}(X_n)\operatorname{\Bbb E}(X_1) & \operatorname{\Bbb E}(X_nX_2)-\operatorname{\Bbb E}(X_n)\operatorname{\Bbb E}(X_2) & \cdots & \operatorname{\Bbb E}(X_n^2)-\operatorname{\Bbb E}(X_n)^2 \end{bmatrix} \\ &=\begin{bmatrix} \operatorname{Var}(X_1) & \operatorname{Cov}(X_1,X_2) & \cdots & \operatorname{Cov}(X_1,X_n) \\ \operatorname{Cov}(X_2,X_1) & \operatorname{Var}(X_2) & \cdots & \operatorname{Cov}(X_2,X_n) \\ \vdots & \vdots & & \vdots \\ \operatorname{Cov}(X_n,X_1) & \operatorname{Cov}(X_n,X_2) & \cdots & \operatorname{Var}(X_n) \end{bmatrix}. \end{align*} This symmetric matrix preserves all the information.