General formulation of machine learning optimisation problem

53 Views Asked by At

I'm having some trouble understanding the first equation from this paper: https://arxiv.org/abs/1805.09545.

enter image description here

I interpret this as us searching for a minimiser $\phi^*$ in a Hilbert space, that has to be a linear combination of "a few" elements from a parameterised set $\{\phi(\theta):\theta\in \Theta\}$. We then replace "linear combination" above by "integral over signed measure". Ignoring the regularisation term $G$, we have $$J(\mu) = R\left( \int \phi(\theta)d\mu(\theta) \right),$$ where the integral is "mixing" all the $\phi(\theta)$ in the parameterised set, giving us a candidate minimiser $\phi':=\int \phi d\mu$, and we may write $R(\phi')$ or just $R(\mu)$ instead of the above display.

  1. Is this interpretation correct?
  2. Why do we pass to signed measures? Does it give the optimisation problem theoretical guarantees it wouldn't have if we restricted to linear combinations?
  3. I don't see how training data is incorporated here - would it be included in the loss function $R$? E.g. say we are doing OLS ($Y=X\beta$), and write $\beta$ instead of $\int \phi d\mu$. Would we then have something like $$R(\beta) = R_{X,Y}(\beta) = \|Y-X\beta\|^2?$$
  4. Why is $\phi$ a linear combination of elements in $\{\phi(\theta):\theta\in \Theta\}$, instead of the parameterised set including all such linear combinations? E.g. in the OLS case, is it more correct to let $\{\phi(\theta):\theta\in \Theta\}$ be the standard basis vectors in $\mathbb{R}^p\ni \beta$, and $\mu$ decide the appropriate linear combination of these basis vectors? And is this essentially why the regularisation is applied to $\mu$, not $\int \phi d\mu$?