Proof of multivariate regression plane maximizes correlation in normals

770 Views Asked by At

I am doing a homework sheet as practice for an upcoming course in multivariate statistics and been stuck on the following problem:

Let $(Y,X_1,X_2,\ldots,X_p)^T\stackrel{d}{=}\mathcal{N}_{p+1}(0,\mathbf{C})$. Prove that $\mathbb{E}(Y|X_1,X_2,\ldots,X_p)$ maximizes Corr($Y,f(X_1,X_2,\ldots,X_p)$) in the space of $f$ linear functions.

My attempt so far:

Proof of the conditional expectation being linear and calculating its value:

We omit the linearly dependent cases due to their triviality. First as a starting point, we prove that the regression is minimized by a linear combination of the $X_i$s, and calculate the coefficients of this. Partition $\mathbf{C}$ and use the well-known identity for block matrices, using $Var Y:=\sigma^2$, $Var X_i=\sigma_i^2=c_{ii}$, $Cov(X_i,Y)=c_i$ and $Cov(X_i,X_j)=c_{ij}$: \begin{equation} \mathbf{C}=\begin{pmatrix} \sigma^2&\mathbf{c}^T\\ \mathbf{c}&\mathbf{D}\\ \end{pmatrix}\qquad \mathbf{C}^{-1}=\begin{pmatrix} \sigma^2&\mathbf{c}^T\\ \mathbf{c}&\mathbf{D}\\ \end{pmatrix}^{-1}= \begin{pmatrix} (\sigma^2-\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c})^{-1}& -\sigma^{-2}\mathbf{c}^T(\mathbf{D}-\mathbf{c}^T\mathbf{c})^{-1}\\ -\sigma^{-2}(\mathbf{D}-\mathbf{c}^T\mathbf{c})^{-1}\mathbf{c}& (\mathbf{D}-\mathbf{c}^T\mathbf{c})^{-1}\\ \end{pmatrix}=:\begin{pmatrix} \alpha^2&\beta^T\\ \beta&\delta\\ \end{pmatrix} \end{equation} With extra assumption $\det\mathbf{C}\det\mathbf{D}\neq 0$. Also note that $Var(\mathbf{a}^T\mathbf{X})=\mathbf{a}^T\mathbf{D}\mathbf{a}$. Thus their joint density function is: \begin{equation} f(y,\mathbf{x})=\frac{1}{\sqrt{(2\pi)^{p+1}\det\mathbf{C} }}\exp\left(-\frac{1}{2}\left((y,\mathbf{x})^T\mathbf{C}^{-1}(y,\mathbf{x})\right)\right)=\frac{1}{\sqrt{(2\pi)^{p+1}\det\mathbf{C} }}\exp\left(-\frac{\alpha^2y^2+2\beta^T\mathbf{x}y+\mathbf{x}^T\delta\mathbf{x}}{2}\right)= \end{equation} \begin{equation} =\frac{1}{\sqrt{(2\pi)^{p+1}\det\mathbf{C} }}\exp\left(-\frac{1}{2}\left(\alpha y-\frac{\mathbf{x}^T\beta}{\alpha}\right)^2-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right) \end{equation} We know from the notes that $\mathbb{E}(Y-g(X_1,\ldots,X_p))^2$ is minimized by: \begin{equation} \mathbb{E}(Y|X_1,\ldots, X_p)=\frac{\displaystyle \int_{-\infty}^{\infty}yf(y,\mathbf{x})\,\mathrm{d}{y}}{\displaystyle \int_{-\infty}^{\infty}f(y,\mathbf{x})\,\mathrm{d}{y}}=\frac{\mathbf{x}^T\beta}{\alpha^2}= \boxed{-(\mathbf{x}^T\sigma^{-2}(\mathbf{D}-\mathbf{c}^T\mathbf{c})^{-1}\mathbf{c})(\sigma^2-\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c})} \end{equation} The denominator: \begin{equation} \int_{-\infty}^{\infty}\frac{1}{\sqrt{(2\pi)^{p+1}\det\mathbf{C} }}\exp\left(-\frac{1}{2}\left(\alpha y-\frac{\mathbf{x}^T\beta}{\alpha}\right)^2-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right)\,\mathrm{d}{y}=\frac{1}{\sqrt{(2\pi)^{p}\det\mathbf{C} }\alpha}\exp\left(-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right) \end{equation} The numerator is an expectation of a normal random variable with mean $\displaystyle\frac{\mathbf{x}^T\beta}{\alpha}$: \begin{equation} \int_{-\infty}^{\infty}\frac{y}{\sqrt{(2\pi)^{p+1}\det\mathbf{C}}}\exp\left(-\frac{1}{2}\left(\alpha y-\frac{\mathbf{x}^T\beta}{\alpha}\right)^2-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right)\,\mathrm{d}{y} =\frac{\mathbf{x}^T\beta}{\sqrt{(2\pi)^{p}\det\mathbf{C} }\alpha^3}\exp\left(-\frac{\mathbf{x}^T\delta\mathbf{x}}{2}+\frac{(\mathbf{x}^T\beta)^2}{2\alpha^2}\right) \end{equation} which proves it is linear. We have to minimize $\displaystyle \mathbb{E}\left(Y-b-\sum_{i=1}^{p}a_iX_i\right)^2$. This is from the exponential family, thus we can switch the order of integration and differentiation: \begin{equation} \mathbb{E}\left(\frac{\partial}{\partial a_i}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=-2c_i+2\sum_{j=1}^{p}a_ic_{ij}\qquad \mathbb{E}\left(\frac{\partial}{\partial b}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=2b \end{equation} \begin{equation} \mathbb{E}\left(\frac{\partial^2}{\partial a_i^2}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=2c_{ii}\qquad \mathbb{E}\left(\frac{\partial^2}{\partial b^2}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=2 \end{equation} \begin{equation} \mathbb{E}\left(\frac{\partial}{\partial a_i \partial a_j}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=2c_{ij}\qquad \mathbb{E}\left(\frac{\partial}{\partial a_i \partial b}\left(Y-b-\sum_{j=1}^{p}a_jX_j\right)^2\right)=0 \end{equation} So the Hessian is positive definite at this point, meaning that the minimum is obtained a when: \begin{equation} \mathbf{D}\mathbf{a}=\mathbf{c}\Rightarrow \boxed{\mathbf{a}=\mathbf{D}^{-1}\mathbf{c}\qquad b=0} \end{equation}

And that it is the maximum

We want to find the maximum in the direction $\mathbf{v}$, so take $t(X_{i=1}^p,u)=(\mathbf{D}^{-1}\mathbf{c}+u\mathbf{v})^T\mathbf{X}$, then using $\mathbf{D}^T=\mathbf{D}$: \begin{equation} {Corr}(Y,t(X_{i=1}^p,u))=\frac{Cov(Y,(\mathbf{D}^{-1}\mathbf{c}+u\mathbf{v})^T\mathbf{X})}{\sqrt{Var Y Var((\mathbf{D}^{-1}\mathbf{c}+u\mathbf{v})^T\mathbf{X})}}= \frac{(\mathbf{D}^{-1}\mathbf{c}+ u\mathbf{v})^T\mathbf{c}} {\sigma\sqrt{\mathbf{c}^T\mathbf{D}^{-1}\mathbf{D}\mathbf{D}^{-1}\mathbf{c}+2u\mathbf{v}^T\mathbf{D}\mathbf{D}^{-1}\mathbf{c}+u^2\mathbf{v}^T\mathbf{D}\mathbf{v}}}= \end{equation} \begin{equation} =\frac{\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}+ u\mathbf{v}^T\mathbf{c}} {\sqrt{\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}+ 2u\mathbf{v}^T\mathbf{c}+ u^2\mathbf{v}^T\mathbf{D}\mathbf{v}}} \end{equation} \begin{equation} \frac{\partial}{\partial u}{Corr}(Y,t(X_{i=1}^p,u))= \frac{u((\mathbf{c}^T\mathbf{v})^2-\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{v}^T\mathbf{D}\mathbf{v})} {\sigma(\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}+2u\mathbf{v}^T\mathbf{c}+u^2\mathbf{v}^T\mathbf{D}\mathbf{v})^{3/2}} \end{equation} With some straightforward and lengthy calculations, we have: \begin{equation} \left.\frac{\partial^2}{\partial u^2}{Corr}(Y,t(X_{i=1}^p,u))\right|_{u=0}=-\frac{\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{v}^T\mathbf{D}\mathbf{v}-(\mathbf{c}^T\mathbf{v})^2}{(\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c})^{3/2}} \end{equation} So we must have $u=0$ or $(\mathbf{v}^T\mathbf{c})^2-\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{v}^T\mathbf{D}\mathbf{v}=0$, meaning that the correlation is constant in the direction of $\mathbf{v}$. Let $\mathbf{v}=\mathbf{v}_\perp+\lambda\mathbf{c}$, where $\mathbf{v}_\perp$ is orthogonal to $\mathbf{c}$, thus: \begin{equation} \mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{v}^T\mathbf{D}\mathbf{v}-(\mathbf{v}\mathbf{c}^T)^2=\lambda^2\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{c}^T\mathbf{D}\mathbf{c}+\lambda\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{c}^T\mathbf{D}\mathbf{v}_\perp-(\lambda\mathbf{c}^T\mathbf{c})^2 \end{equation} If $\mathbf{Y}$ is independent, then the correlation is $0$. If that's not the case, with scaling we can achieve $\textbf{c}^T\textbf{c}=1$. $\mathbf{c}^T\mathbf{D}^{-1}\mathbf{c}\mathbf{c}^T\mathbf{D}\mathbf{c}=(\mathbf{c}^T\mathbf{c})^2$

And there I'm just stuck, any ideas how to continue?

2

There are 2 best solutions below

0
On BEST ANSWER

Lets start with the univariate case:

$$X_1 \sim \mathcal{N}(\mu,\sigma)$$ with pdf $\phi(x)$ standing in for the currently specified gaussian (I'm using it as shorthand for "the density for the variable" its not necessarily the standard density).

Now, lets find a linear predictor that minimizes the mean squared error:

$$f(x):=ax+b \implies MSE_f(a,b) := \int \phi(z)(f(z)-\mu)^2dz=\int \phi(z)(az+b-\mu)^2dz=$$ $$a^2\int z^2\phi(z)dz+2a(b-\mu)\int z\phi(z)dz+(b-\mu)^2\int \phi(z)dz =$$ $$a^2(\sigma^2+\mu^2)+2a(b-\mu)\mu +(b-\mu)^2 $$

This function is convex in $a,b$, so setting $\nabla_{a,b} MSE_f=\mathbf{0}$, we get:

$$\frac{\partial}{\partial a}MSE_f = 2a(\sigma^2+\mu^2)+2\mu(b-\mu)= 0 \implies a = \frac{\mu(\mu-b)}{\sigma^2+\mu^2}$$ $$\frac{\partial}{\partial b}MSE_f = 2a\mu+2(b-\mu) =0\implies b=\mu-a\mu$$

Sub in for $b$ in equation 1 to get:

$$a(\sigma^2+\mu^2)=\mu(\mu-\mu+a\mu)=a\mu^2 \implies a=0 \implies b=\mu$$

So, if we were given $X_1$, we would minimize our squared error loss by choosing $f=E[X_1]=\mu$

Now, we haven't gotten to correlation yet, but lets go to 2 variables and using your zero mean multivariate gaussian:

$$(X_1,X_2)\sim \mathcal{N}(\mathbf{0},\mathbf{C})$$

Now, we want to find a linear function $f(X_2)$ that is maximally correlated with $X_1$. First, lets try to find the least squares solution:

$$\arg \min_{aX_2+b} \int (X_1-aX_2-b)^2\phi(X_1|X_2)$$

We don't need to go through all the tedious algebra for this, since we know that this integral is minimized when, $aX_2+b = E[X_1|X_2]$ by extension from the univariate case.

If you've taken a basic linear models course, you'll know that we can partition our variance into $Var(X_1)=Var(f(X_2)) + Var(X_1-f(X_2))$ which is just the variance of the regression function plus the variance of the residuals. However, we also know that the coefficient of determination is:

$$\rho^2 = \frac{Var(f(X_2))}{Var(f(X_2)) + Var(X_1-f(X_2))}$$

Since we minimized $Var(X_1-f(X_2))$ by setting $f(X_2)=E[X_1|X_2]$, we have maximized $\rho^2$. Since $x^2$ is strictly increasing on $[0,\infty]$, we have also maximized $|\rho|$. In particular we have also maximized $Var(f(X_2))$ s.t. $Var(f(X_2))\leq Var(X_1)$. We can show this is equivalent to maximizing the positive correlation between $f(X_2)$ and $X_1$:

$$Var(X_1)=Var(f(X_2)) + Var(X_1-f(X_2))=$$ $$Var(f(X_2))+Var(X_1)+Var(f(X_2))-2Cov(X_1,f(X_2))$$

Rearranging and combining, we get:

$$0=Var(f(X_2))-Cov(X_1,f(X_2)) \implies Cov(X_1,f(X_2))=Var(f(X_2))$$

$$\rho(X_1,f(X_2)) = \frac{Cov(X_1,f(X_2))}{\sqrt{Var(f(X_2))Var(X_1)}}=\sqrt{\frac{Var(f(X_2))}{Var(X_1)}}$$

Since $Var(X_1)$ is fixed, and we have maximized $Var(f(X_2))$ s.t. $Var(f(X_2))\leq Var(X_1)$, we have maximized the correlation too.


Extending to multivariate case

Lets say we now have $X:=(X_1,X_2,X_3)\sim \mathcal{N}(\mathbf{0},\mathbf{C})$

Since we know $X$ is multivariate gaussian, we know its marginals are also gaussian. Hence, $aX_2+bX_3+c$ is also normally distributed. Lets define $Z=aX_2+bX_3+c$, then we have another multivariate gaussian $X'=(X_1,Z)$. Applying the logic from the two variable case, we see that choosing $Z:=E[X_1|X_2,X_3]$ will maximize the correlation between $Z and X_1$. The linearity of expectation and closure of the gaussian distribution under vector operations implies that $E[X_1|X_2,X_3]$ is also a linear function of normal distributions.

So, by induction, you can extend this reasoning in pairwise fashion to an arbitrary number of dimensions.

0
On

Let $(Y,X_1,X_2,\ldots,X_p)^T\stackrel{d}{=}\mathcal{N}_{p+1}(0,\mathbf{C})$. Prove that $\mathbb{E}(Y|X_1,X_2,\ldots,X_p)$ maximizes Corr($Y,f(X_1,X_2,\ldots,X_p)$) in the space of $f$ linear functions.

Notation: \ The matrix $\mathbf{C}$ be decomposed as \begin{equation*} \mathbf{C}=\begin{pmatrix} c_{yy}&\mathbf{c}^T_{yx}\\ \mathbf{c}_{yx}&\mathbf{C}_{xx}\\ \end{pmatrix} \end{equation*} Let $ \mathbf{X} = (X_1,X_2,\ldots,X_p) $ then \begin{equation*} \quad \Rightarrow \quad \mathbb{V}\mbox{ar}[Y] =c_{yy}, \quad \mathbb{C}\mbox{ov}[Y, \mathbf{X}] = \mathbf{c}_{yx} \end{equation*} and from the properties of the multivariate normal distribution we have that $$\mathbb{E}(Y|X_1,X_2,\ldots,X_p ) = \mathbf{c}_{yx} \mathbf{C}_{xx}^{-1}\mathbf{X} $$

Proof: \ The space of $f$ linear functions is given by $$ f(X_1,X_2,\ldots,X_p) = \mathbf{a}^T \mathbf{X}+ b $$ where $ \mathbf{X} = (X_1,X_2,\ldots,X_p) $, $\mathbf{a} = (a_1,a_2,\ldots,a_p) \in \mathbb{R}^p$ for $i=1,\ldots p$ and $b \in \mathbb{R}$. Recall the correlation between two random variables $Y$ and $Z$ is given by $$ \mbox{Corr}(Y, Z) = \frac{ \mbox{Cov}(X, Z) }{ \sqrt{ \mbox{Var}(Y) \mbox{Var}(Z)} } $$ If we let $ Z = \mathbf{a}^T \mathbf{X} $ we have $$ E[Z] = E[ \mathbf{a}^T \mathbf{X}+ b] = \mathbf{a}^T E[\mathbf{X}+ b = b $$ $$ \mathbb{V}\mbox{ar}[Z] = \mathbb{V}\mbox{ar}[ \mathbf{a}^T \mathbf{X}+ b] = \mathbf{a}^T \mathbb{V}\mbox{ar}[\mathbf{X}] \mathbf{a} = \mathbf{a}^T \mathbf{C}_{xx} \mathbf{a} $$ %$$ \mathbb{C}\mbox{ov}[Y, Z] = \mathbb{C}\mbox{ov}[ Y, \mathbf{a}^T \mathbf{X}+ b] = \mathbb{C}\mbox{ov}[ Y, \mathbf{X} ] \mathbf{a} = \mathbf{c}_{yx}^T \mathbf{a} $$ and we are given that $\mathbb{V}\mbox{ar}[Y] = c_{yy}$ which is a constant with respect to $(\mathbf{a},b) $ so the problem reduces to $$ \max_{ (\mathbf{a},b) \in \mathbb{R}^{p+1} } \mathbb{C}\mbox{orr}(Y, Z) = \max_{ (\mathbf{a},b) \in \mathbb{R}^{p+1} } \frac{ \mathbf{c}_{yx}^T \mathbf{a} }{ \sqrt{ c_{yy} \mathbf{a}^T \mathbf{C}_{xx} \mathbf{a} } } = \max_{ (\mathbf{a},b) \in \mathbb{R}^{p+1} } \frac{ \mathbf{c}_{yx}^T \mathbf{a} }{ \sqrt{ \mathbf{a}^T \mathbf{C}_{xx} \mathbf{a} } } $$ Some thoughts,

  • The variance of $Y$, $c_{yy}$ is a constant with respect to $\mathbf{a} $.
  • This expression does not include $b$ so we can set it to be anything. I will choose $b=0$ but $b=10$ works as well.
  • If $\mathbb{C}\mbox{ov}(Y, Z) = \mathbf{c}_{yx} = \mathbf{0}$ then any $\mathbf{a} \in \mathbf{R}^p$ will maximize the correlation because the correlation equals $0$ for all $\mathbf{a} \in \mathbf{R}^p$ .

Putting that aside we can make this problem easier by applying the transformation $ \mathbf{v} = \mathbf{C}_{xx}^{1/2} \mathbf{a}$ such that $\mathbf{C}_{xx}^{1/2} \mathbf{C}_{xx}^{1/2} = \mathbf{C}_{xx} $ which is possible because $ \mathbf{C}_{xx}$ is positive definite. Then in addition we let $\mathbf{u} =\mathbf{C}_{xx}^{-1/2} \mathbf{c}_{yx}$ and we have $$ \max_{ \mathbf{v} \in \mathbb{R}^{p} } \frac{ \mathbf{c}_{yx}\mathbf{C}_{xx}^{-1/2} \mathbf{a} }{ \sqrt{ \mathbf{v}^T \mathbf{v} } } = \max_{ \mathbf{v} \in \mathbb{R}^{p} } \frac{ \mathbf{u}^T \mathbf{v} }{ \sqrt{ \mathbf{v}^T \mathbf{v} } } $$ Now recall the Cauchy-Schwarz inequality, using the dot product as the inner product, states that for vectors $\mathbf{u}$ and $\mathbf{v}$ $$ | \mathbf{u}^T \mathbf{v} | \le \sqrt{ \mathbf{u}^T \mathbf{u} }\sqrt{ \mathbf{v}^T \mathbf{v} } \quad \quad \Rightarrow \quad \quad \frac{ | \mathbf{u}^T \mathbf{v} | }{\sqrt{ \mathbf{u}^T \mathbf{u} }\sqrt{ \mathbf{v}^T \mathbf{v} } } \le 1 $$ Our expression for correlation can be made equal to this because $\mathbf{u}$ is constant with respect to $\mathbf{v}$. If we have equality in the Cauchy-Schwarz inequality, we will achieve our goal of maximizing the correlation. Equality in the Cauchy-Schwarz inequality occurs when $ \mathbf{v}$ and $\mathbf{u}$ are linearly dependent. These vectors are linearly dependent if there exist a constant $\alpha$, $$ \mathbf{v} = \alpha \mathbf{u} \quad \Rightarrow \quad \mathbf{C}_{xx}^{1/2} \mathbf{a} = \alpha \mathbf{C}_{xx}^{-1/2} \mathbf{c}_{yx} \quad \Rightarrow \quad \mathbf{a} = \alpha \mathbf{C}_{xx}^{-1} \mathbf{c}_{yx} $$ and therefore when $\alpha =1 $, a linear functional of $X_1,X_2,\ldots,X_p$ that maximizes the correlation with $Y$ is $$ \mathbf{a}^T \mathbf{X} = \mathbf{c}_{yx}^T \mathbf{C}_{xx}^{-1}\mathbf{X} = \mathbb{E}(Y|X_1,X_2,\ldots,X_p) $$ Some thoughts,

  • This is scale multiple of $\mathbb{E}(Y|X_1,X_2,\ldots,X_p)$, setting set $\alpha =1$ obtains the desired answer however we can use $\alpha = 2$ and maximize the correlation.

  • This can be done using vector calculus but you basically prove one version of the Cauchy-Schwarz inequality again.