Apologies if I get some of the wording wrong, I am still trying to get my head around this.
I have noticed that Bayes' Theorem can be used to optimize a function / find out more about it - such as these guys are doing: https://github.com/fmfn/BayesianOptimization - but It is really unclear to me how it works.
I do not think I have fully understood how to go from Bayes' Theorem to Bayesian Optimization of functions. I was hoping to find a simple tutorial online about how Bayesian Optimization for the equation of a line would work - as I thought starting with a line should be simple.
Could someone maybe explain to me if that is even possible / point me to the right place?
Thank you.
The simplest approach is to do Bayesian optimization (in this case often called MAP estimation) of the linear regression.
First, we need to formulate the problem probabilistically (Bayes' theorem is for RVs after all). Let $\varepsilon\sim\mathcal{N}(0,\sigma_\varepsilon^2)$ represent an RV for the noise. Suppose we have a set of pairs $D=\{(x_i,y_i)\}$. Then our assumption is that $x$ and $y$ are linearly related $$ y = w^Tx+\varepsilon $$ with some noise. Since $w$ is deterministic, we get that $ y\sim\mathcal{N}(w^Tx,\sigma_\varepsilon^2) $. Thus, if we knew or at least had a guess for $w$, we could compute the density of a particular $y$ given $x$. Written as a conditional density: $$ P(y|w,x)=\frac{1}{\sigma_\varepsilon\sqrt{2\pi}}\exp\left( \frac{-[y-w^Tx]^2}{2\sigma_\varepsilon^2} \right) $$ Our task is to use $D$ to get $w$. A sensible thing to do might be estimating $w$ to maximize $P(D|w)$. Let's use Bayes' rule now: $$ P(w|D) = \frac{P(D|w)P(w)}{P(D)} $$ Our data $D$ is constant, so we can ignore $P(D)$ in the optimization. Next, see that we can say $P(D|w)=\prod_j P(y_j|x_j,w)$, by assuming the data is IID. The last problem is $P(w)$. This is solved in a Bayesian fashion, by assuming a prior distribution: $$ w\sim\mathcal{N}(0,\eta I)$$ Why did we pick this to be the prior? Well, we can assume one might want the weights to be small; this acts as a form of regularization to prevent overfitting, at the cost of some possible bias.
Informally (and roughly), the Bayesian interpretation is that we believe the weights have this prior distribution in nature, and that once the data is observed, we will update our beliefs with this information.
Ok, so now we want the most likely $ w$, given $D$. Taking the logarithm for numerical (and perhaps aesthetic) reasons, we get: \begin{align} \hat{w} &= \arg\max_w \ln P(w|D) \\ &= \arg\max_w \ln\left( \frac{P(D|w)P(w)} {P(D)} \right) \\ &= \arg\max_w \ln P(D|w) + \ln P(w) \\ &= \arg\max_w \ln\left[\prod_j P(y_j|w,x_j)\right] + \ln\left( \frac{1}{(2\pi)^{d/2}}\exp\left[\frac{-w\cdot w}{2\eta}\right] \right) \\ &= \arg\max_w \sum_j \ln \frac{1}{\sigma_\varepsilon\sqrt{2\pi}}\exp\left( \frac{-[y_j-w^Tx_j]^2}{2\sigma_\varepsilon^2} \right) -\frac{d}{2}\ln(2\pi) - \frac{w\cdot w}{2\eta} \\ &= \arg\max_w \sum_j \ln \left(\frac{1}{\sigma_\varepsilon\sqrt{2\pi}}\right) + \left( \frac{-[y_j-w^Tx_j]^2}{2\sigma_\varepsilon^2} \right) - \frac{w\cdot w}{2\eta} \\ &= \arg\max_w \frac{1}{2\sigma_\varepsilon^{2}} \sum_j {-[y_j-w^Tx_j]^2} - \frac{||w||_2^2}{2\eta} \\ &= \arg\min_w \frac{1}{2\sigma_\varepsilon^{2}} \sum_j {[y_j-w^Tx_j]^2} + \frac{||w||_2^2}{2\eta} \end{align}
You might recognize this; it is regularized least squares for linear regression! The essence of the work is now done; how this quantity is found by optimization is not important (at least, as far as Bayes is concerned).
So, in fact, using some basic assumptions about our model and a prior distribution assumption for the weights, we can use Bayes' theorem to derive an objective function, which we may optimize for our task. For more info about MAP linear regression, see e.g. here.
I will just note in passing that this is a little different from the Bayesian optimization methodology that you linked. We just showed how to choose the parameters of our learning model using Bayes' theorem, but often one wants to choose the hyper-parameters via Bayesian optimization instead. Here, that might be $\eta$ for instance.