How to calculate $\frac{\partial^2}{\partial \theta^2} \bigg(\frac{1}{n} \sum_{i=1}^{n} \exp(-Y_i (\theta^T X_i))\bigg)$?

146 Views Asked by At

I have a learning sample $D_n = f(X_i, Y_i)_{i=1}^n$ where the $X_i$’s are $\mathbb{R}^d$-valued and the $Y_i$’s are $\{-1, 1\}$-valued.

$$ f(\theta) = \frac{1}{n} \sum_{i=1}^{n} \exp(-Y_i (\theta^T X_i)) $$

Where $\theta \in [-B, +B]$.

I want to calculate the Hessian matrix: $\nabla^2 f(\theta)$.

2

There are 2 best solutions below

8
On BEST ANSWER

Lets focus on what is $\frac{d^2}{d\theta_j^2}f(\theta)$.

Firstly you can bring the operator inside of the sum by linearity. The contribution of $\theta_j$ to the power of $e$ in each term is $-Y_i \theta_j (X_i)_j $, therefore the derivative multiplies by $-Y_i (X_i)_j$.

$(X_i)_j$ means the $jth$ component of $X_i$.

So hence $\frac{d^2}{d\theta_j^2}f(\theta) = \frac{1}{n}∑_{i=1}^n Y_i^2 (X_i)_j^2 \exp(−Y_i(θ^TX_i))$

Similarly, $\frac{d^2}{d\theta_j \theta_k}f(\theta) = \frac{1}{n}∑_{i=1}^n Y_i^2 (X_i)_j (X_i)_k\exp(−Y_i(θ^TX_i))$

1
On

Let's employ a convention where uppercase latin letters are matrices, lowercase latin are vectors, and lowercase greek letters are scalars. In this convention, the $(X,Y,\theta,f)$ variables in the problem become $(X,y,w,\phi)$ with dimensions $$\eqalign{ X&\in{\mathbb R}^{d\times n},\quad y\in{\mathbb R}^{n\times 1},\quad w\in{\mathbb R}^{d\times 1},\quad\phi\in{\mathbb R} \cr }$$ Let's also define some additional variables which will prove convenient. $$\eqalign{ &b=X^Tw,\quad a=y\odot b=Yb,\quad &e=\tfrac{1}{n}\exp(-a) \cr &Y={\rm Diag}(y),\quad E={\rm Diag}(e),\quad &XY=Z\in{\mathbb R}^{d\times n} \cr &y,a,b,e \in {\mathbb R}^{n\times 1} \cr &Y,E \in {\mathbb R}^{n\times n} \cr }$$ Write the function and find its gradient in terms of these new variables. $$\eqalign{ \phi &= 1:e \cr d\phi &= 1:de = 1:(-e\odot da) = -e:da \cr&= -e:Y^TX^Tdw = -Ze:dw \cr g=\frac{\partial\phi}{\partial w} &= -Ze \cr }$$ So that's the gradient vector. To find the hessian matrix, take the differential of $g$. $$\eqalign{ dg &= -Z\,de = -Z(-e\odot da) = +ZE\,da \cr &= ZE(Y^TX^T\,dw) = ZEZ^Tdw \cr H = \frac{\partial g}{\partial w} &= ZEZ^T \cr }$$ In the above steps, the $(\exp)$ function is applied elementwise, $(1)$ is used to denote a vector with all elements equal to one, $(\odot)$ denotes the elementwise/Hadamard product, and $(:)$ denotes the trace/Frobenius product, i.e. $$\eqalign{A:B = {\rm Tr}(A^TB) }$$ Note that since $(Y,E)$ are diagonal matrices, they are also symmetric. And that the symmetry of the latter ensures that the hessian matrix is symmetric, as it must be.

Since $E$ comes from the $\exp()$ function, it is positive definite and has a unique matrix square-root (which is also positive definite). So the hessian can be factored as $(Z\sqrt{E})\,(Z\sqrt{E})^T$ and this means that $H$ is at least non-negative definite.