I was reading the following paper of dimensionality reduction (1) and also one on theory of networks for approximations and learning (2) and was trying to understand how the regularization problem leads to the form of the predictor function $f$.
In other words, I was trying to fully understand the details of how if one tries to minimize the functional:
$$ H[f] = \sum^{N}_{i=1} (y_i - f(x_i))^2 + \lambda \| Pf \|^2 $$
why the solution of the variational problem has the following simple form:
$$ f(x) = \sum^N_{i=1} c_i G(x ; x_i) + p(x)$$
where $G(x)$ is the Green's function of the self-adjoint differential operator $\hat{P}P$, $\hat{P}$ being the adjoint operator of P, $p(x)$ is a linear combination of functions that span the null space of $P$, and the coefficients $c_i$ satisfy a linear system of equations that depend on the $N$ "examples", i.e. the data to be approximated.
Why is it that that it has to be a linear combination of the Green's function? Why does the Green's function matter in this case?
I think they try to explain try to explain it on the second paper (2) but I didn't really understand the details. If someone understood the details better, I would be extremely grateful if they would explain it to me.
To understand this I was going through the following videos:
https://www.youtube.com/watch?v=4U3P0LcaJcw&index=27&list=PL4C6F6B595A5852E8
https://www.youtube.com/watch?v=6n0uINcvx_E&index=28&list=PL4C6F6B595A5852E8
https://www.youtube.com/watch?v=cE4ZWo3pcCk&index=29&list=PL4C6F6B595A5852E8
I think they explain it there too but I was having some issue understanding everything. I am still in the process of going through these videos but I will add additional details as they come up through the derivation.