I'm having some trouble understanding the first equation from this paper: https://arxiv.org/abs/1805.09545.
I interpret this as us searching for a minimiser $\phi^*$ in a Hilbert space, that has to be a linear combination of "a few" elements from a parameterised set $\{\phi(\theta):\theta\in \Theta\}$. We then replace "linear combination" above by "integral over signed measure". Ignoring the regularisation term $G$, we have $$J(\mu) = R\left( \int \phi(\theta)d\mu(\theta) \right),$$ where the integral is "mixing" all the $\phi(\theta)$ in the parameterised set, giving us a candidate minimiser $\phi':=\int \phi d\mu$, and we may write $R(\phi')$ or just $R(\mu)$ instead of the above display.
- Is this interpretation correct?
- Why do we pass to signed measures? Does it give the optimisation problem theoretical guarantees it wouldn't have if we restricted to linear combinations?
- I don't see how training data is incorporated here - would it be included in the loss function $R$? E.g. say we are doing OLS ($Y=X\beta$), and write $\beta$ instead of $\int \phi d\mu$. Would we then have something like $$R(\beta) = R_{X,Y}(\beta) = \|Y-X\beta\|^2?$$
- Why is $\phi$ a linear combination of elements in $\{\phi(\theta):\theta\in \Theta\}$, instead of the parameterised set including all such linear combinations? E.g. in the OLS case, is it more correct to let $\{\phi(\theta):\theta\in \Theta\}$ be the standard basis vectors in $\mathbb{R}^p\ni \beta$, and $\mu$ decide the appropriate linear combination of these basis vectors? And is this essentially why the regularisation is applied to $\mu$, not $\int \phi d\mu$?
