I'm currently considering a problem where we have data vectors $X_1,\dots,X_n\in\mathbb{R}^d$ with labels $y_i=\pm1$. I want to find the minimizer of $L(\theta)=\sum_i(1-y_iX_i^t\theta)^2$ by finding the derivative and setting it equal to $0$. My attempt so far is:
\begin{equation} \frac{\partial L(\theta)}{\partial \theta} = -2\sum_i(1-y_iX_i^t\theta)y_iX_i=0\\ \iff\sum_iy_iX_i=\sum_iX_i^t\theta X_i. \end{equation}
If this is correct, how do I solve for $\theta$ here? I know $\theta$ doesn't depend on $i$, but my understanding is that I can't pull it out of the sum since it's a vector. I have gone as far as calculating the Hessian matrix $H$:
\begin{equation} [H]_{jk}=\frac{\partial^2L(\theta)}{\partial\theta_j\partial\theta_k} = \frac{\partial}{\partial\theta_k}\frac{\partial L(\theta)}{\partial\theta_j}\\ = \frac{\partial}{\partial\theta_k}\left(-2\sum_i(y_iX_{ij}-X_{ij}X_i^t\theta)\right)\\ = 2\sum_iX_{ij}\left(\frac{\partial}{\partial\theta_k}X_i^t\theta\right)\\ =2\sum_iX_{ij}X_{ik}\\ =2X^tX. \end{equation} where $X=[X_1^t,\dots,X_n^t]^t\in \mathbb{R}^{n\times d}$ (called the design matrix in statistics). This shows $H$ is PSD, so therefore the $\theta$ I'm looking for is a minimizer. Thanks for your help.
$ \def\bbR#1{{\mathbb R}^{#1}} \def\o{{\tt1}}\def\p{\partial}\def\L{{\cal L}} \def\LR#1{\left(#1\right)} \def\vecc#1{\operatorname{vec}\LR{#1}} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} $Let's use a convention wherein lower/uppercase letters denote vectors/matrices (respectively) and define the following variables $$\eqalign{ &X = \m{x_1&x_2&\ldots&x_n} &\qiq \;X\in\bbR{d\times n} \\ &\o = {\rm all\;ones\;vector} &\qiq \;\;\o\in\bbR{n} \\ &w = \theta &\qiq \;\,w\in\bbR{d} \\ &Y = \Diag y=Y^T &\qiq Y\o=y\in\bbR{n}\\ &v = \LR{YX^Tw - \o} &\qiq \;dv = YX^Tdw \\ }$$ Now the objective function can be written without using any $\Sigma$ symbols, making the gradient calculation considerably easier $$\eqalign{ \L &= v^Tv \\ d\L &= 2v^T\c{dv} \;=\; 2v^T\CLR{YX^Tdw} \;=\; \LR{2XYv}^Tdw \\ \grad{\L}{w} &= 2XYv \;=\; 2\LR{XY^2X^Tw-Xy} \\ }$$ Set the gradient to zero and solve for the optimal $w$-vector $$\eqalign{ XY^2X^Tw &= Xy \qiq w &= \LR{XY^2X^T}^{-1}Xy \\ }$$
Update
The elements $\,y_k=\pm\o\implies Y^2=I,\,$ so the final result can be simplified to $$\eqalign{ w &= \LR{XX^T}^{-1}Xy \\ }$$ Also, the way I've defined $X$ makes it the transpose of the Design Matrix.