How to differentiate product of vectors (that gives scalar) by vector?

693 Views Asked by At

I'm trying to understand derivation of the least squares method in matrices terms: $$S(\beta) = y^Ty - 2 \beta X^Ty + \beta ^ T X^TX \beta$$ Where $\beta$ is $m \times 1$ vertical vector, $X$ is $n \times m$ matrix and $y$ is $n \times 1$ vector. The question is: why $$\frac{d(2\beta X^Ty)}{d \beta} = 2X^Ty$$ I tried to derive it directly via definition of derivative: $$\frac{d(2\beta X^Ty)}{d \beta} = \lim_{\Delta \beta \to 0} \frac{2\Delta\beta X^T y}{\Delta \beta} = \lim_{\Delta \beta \to 0} 2\Delta\beta X^T y \cdot \Delta \beta^{-1}$$ May be the last equality must be as in the next line, but anyway I don't understand why $$2\Delta\beta \Delta \beta^{-1} X^T y $$And, what is $\Delta \beta^{-1}$? Vectors don't have the inverse form.

The same questions I have to this quasion: $$(\beta ^ T X^TX \beta)' =2 X^T X \beta$$

4

There are 4 best solutions below

0
On BEST ANSWER

There are two approaches when taking vector derivatives. First, you can work in coordinates. This will always work, but is not always pleasant. In this case $$S(\beta) = y^Ty - 2\sum_i \beta_i(X^Ty)_i + \sum_{i,j} \beta_i (X^TX)_{ij} \beta_j$$ so \begin{align*} \frac{\partial S}{\partial \beta_k} &= -2\sum_i \delta_{ik}(X^Ty)_i + \sum_{i,j} \delta_{ik}(X^TX)_{ij}\beta_j + \sum_{i,j} \beta_i(X^TX)_{ij}\delta_{jk}\\ &= -2(X^Ty)_ k + \sum_j (X^TX)_{kj}\beta_j + \sum_i \beta_i(X^TX)_{ik}\\ \frac{\partial S}{\partial \beta} &= -2X^Ty + 2(X^TX)\beta. \end{align*}

The second approach is to work with the differential $dS(\beta)[\delta \beta]$ which computes the directional derivative $\frac{d}{dt}S(\beta + t\delta \beta)\Big\vert_{t\to 0}$; since the directional derivative is linear you must have $$dS(\beta)[\delta \beta] = \left(\frac{\partial S}{\partial \beta}\right)^T\delta \beta$$ and so you can often recover an elegant, coordinate-free expression for the derivative from the differential. I wrote up some notes on this here: https://www.dropbox.com/s/7bj966ifgqiljmt/calculus.pdf?dl=0

In this case \begin{align*} dS(\beta)[\delta \beta] &= -2\delta \beta^TX^Ty + \delta \beta^TX^TX\beta + \beta^TX^TX\delta \beta\\ &= \left[-2y^TX +2\beta^TX^TX\right]\delta \beta. \end{align*}

0
On

What is meant by the vector derivative $\frac{dF}{d\beta}$ is the vector with components $\frac{dF}{d\beta_k}$. Then $$\frac{d}{d\beta_k}2\beta^T X^T y=\frac{d}{d\beta_k}\sum_{i,j}2\beta_i X_{ji} y_j=\sum_{i,j}2\delta_{ik} X_{ji} y_j=\sum_{j}2 X_{jk} y_j=(2X^T y)_k,$$ so indeed $\frac{d}{d\beta}(2\beta^T X^T y)=2 X^T y$ as desired.

0
On

Let scalar field $f : \mathbb{R}^n \to \mathbb{R}$ be defined by

$$f (\mathrm x) = \mathrm a^T \mathrm x = a_1 x_1 + a_2 x_2 + \cdots + a_n x_n$$

Taking the $n$ partial derivatives,

$$\partial_1 f (\mathrm x) = a_1 \qquad \qquad \partial_2 f (\mathrm x) = a_2 \qquad \dots \qquad\partial_n f (\mathrm x) = a_n, \qquad$$

Hence, the gradient of $f$ is

$$\nabla f (\mathrm x) = (a_1, a_2, \dots, a_n) = \mathrm a$$

0
On

Recall that the multiple regression linear model is the equation given by $$Y_i = \beta_0 + \sum_{j=1}^{p}X_{ij}\beta_{j} + \epsilon_i\text{, } i = 1, 2, \dots, N\tag{1}$$ where $\epsilon_i$ is a random variable for each $i$. This can be written in matrix form like so. \begin{equation*} \begin{array}{c@{}c@{}c@{}c@{}c@{}c} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_N \end{bmatrix} &{}={} &\begin{bmatrix} 1 & X_{11} & X_{12} & \cdots & X_{1p} \\ 1 & X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & X_{N1} & X_{N2} & \cdots & X_{Np} \end{bmatrix} &\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} &{}+{} &\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{bmatrix} \\ \\[0.1ex] \mathbf{y} &{}={} &\mathbf{X} &\boldsymbol{\beta} &{}+{} &\boldsymbol{\epsilon}\text{.} \end{array} \end{equation*} Since $\boldsymbol{\epsilon}$ is a vector of random variables, note that we call $\boldsymbol{\epsilon}$ a random vector. Our aim is to find an estimate for $\boldsymbol{\beta}$. One way to do this would be by minimizing the residual sum of squares, or $\text{RSS}$, defined by $$\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^{N}\left(y_i - \sum_{j=0}^{p}x_{ij}\beta_{j}\right)^2$$ where we have defined $x_{i0} = 1$ for all $i$. (These are merely the entries of the first column of the matrix $\mathbf{X}$.) Notice here we are using lowercase letters, to indicate that we are working with observed values from data. To minimize this, let us find the critical values for the components of $\boldsymbol{\beta}$. For $k = 0, 1, \dots, p$, notice that $$\dfrac{\partial\text{RSS}}{\partial\beta_k}(\boldsymbol{\beta}) = \sum_{i=1}^{N}2\left(y_i - \sum_{j=0}^{p}x_{ij}\beta_{j}\right)(-x_{ik}) = -2\sum_{i=1}^{N}\left(y_ix_{ik} - \sum_{j=0}^{p}x_{ij}x_{ik}\beta_{j}\right)\text{.}$$ Setting this equal to $0$, we get what are known as the normal equations: $$\sum_{i=1}^{N}y_ix_{ik} = \sum_{i=1}^{N}\sum_{j=0}^{p}x_{ij}x_{ik}\beta_{j}\text{.}\tag{2}$$ for $k = 0, 1, \dots, p$. This can be represented in matrix notation. For $k = 0, 1, \dots, p$, $$\begin{align*} \sum_{i=1}^{N}y_ix_{ik} &= \begin{bmatrix} x_{1k} & x_{2k} & \cdots & x_{Nk} \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{N} \end{bmatrix} = \mathbf{c}_{k+1}^{T}\mathbf{y} \end{align*}$$ where $\mathbf{c}_\ell$ denotes the $\ell$th column of $\mathbf{X}$, $\ell = 1, \dots, p+1$. We can then represent each equation for $k = 0, 1, \dots, p$ as a matrix. Then $$\begin{bmatrix} \mathbf{c}_{1}^{T}\mathbf{y} \\ \mathbf{c}_{2}^{T}\mathbf{y} \\ \vdots \\ \mathbf{c}_{p+1}^{T}\mathbf{y} \end{bmatrix} = \begin{bmatrix} \mathbf{c}_{1}^{T} \\ \mathbf{c}_{2}^{T} \\ \vdots \\ \mathbf{c}_{p+1}^{T} \end{bmatrix}\mathbf{y} = \begin{bmatrix} \mathbf{c}_{1} & \mathbf{c}_{2} & \cdots & \mathbf{c}_{p+1} \end{bmatrix}^{T}\mathbf{y} = \mathbf{X}^{T}\mathbf{y}\text{.} $$ For justification of "factoring out" $\mathbf{y}$, see this link, on page 2. For the right-hand side of $(2)$ ($k = 0, 1, \dots, p$), $$\begin{align*} \sum_{i=1}^{N}\sum_{j=0}^{p}x_{ij}x_{ik}\beta_{j} &= \sum_{j=0}^{p}\left(\sum_{i=1}^{N}x_{ij}x_{ik}\right)\beta_{j} \\ &= \sum_{j=0}^{p}\left(\sum_{i=1}^{N}x_{ik}x_{ij}\right)\beta_{j} \\ &=\sum_{j=0}^{p}\begin{bmatrix} x_{1k} & x_{2k} & \cdots & x_{Nk} \end{bmatrix} \begin{bmatrix} x_{1j} \\ x_{2j} \\ \vdots \\ x_{Nj} \end{bmatrix}\beta_j \\ &= \sum_{j=0}^{p}\mathbf{c}^{T}_{k+1}\mathbf{c}_{j+1}\beta_j \\ &= \mathbf{c}^{T}_{k+1}\sum_{j=0}^{p}\mathbf{c}_{j+1}\beta_j \\ &= \mathbf{c}^{T}_{k+1}\begin{bmatrix} \mathbf{c}_{1} & \mathbf{c}_2 & \cdots & \mathbf{c}_{p+1} \end{bmatrix}\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \\ &= \mathbf{c}^{T}_{k+1}\mathbf{X}\boldsymbol{\beta}\text{.} \end{align*} $$ Bringing each case into a single matrix, we have $$\begin{bmatrix} \mathbf{c}^{T}_{1}\mathbf{X}\boldsymbol{\beta}\\ \mathbf{c}^{T}_{2}\mathbf{X}\boldsymbol{\beta}\\ \vdots \\ \mathbf{c}^{T}_{p+1}\mathbf{X}\boldsymbol{\beta}\\ \end{bmatrix} = \begin{bmatrix} \mathbf{c}^{T}_{1}\\ \mathbf{c}^{T}_{2}\\ \vdots \\ \mathbf{c}^{T}_{p+1}\\ \end{bmatrix}\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta}\text{.}$$ Thus, in matrix form, we have the normal equations as $$\mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^{T}\mathbf{y}\text{.}\tag{3}$$