I am taking an online class on Machine Learning and I'm trying to fully understand how the cost function work. Can someone explain to me exactly what is going on in the function below:
Cost function $$J(\theta_0, \theta_1) = \frac{1}{2} m \sum_{i=1}^m(h_\theta(x^{(i)} - y^{(i)})^2 $$
Hypothesis function $$h_\theta(x) = \theta_0 + \theta_1 x$$
The parameters are: $\;\theta_0$ and $\theta_1$.
I want to be able to understand what is happening in the cost function, the thing I don't really understand about the cost function is the summation ($\sum$), I don't know what it means.
Can someone help me?
Thank you.
Firstly, this is a Sigma $\sum$ denoting a sum and not an epsilon ($\epsilon$). The two symbols are used in two different contexts in maths, and the $\epsilon-\delta$ tag isn't applicable to your question.
In simple terms, we are given the set of data points $D:= \{ (x_i,y_i),\forall 1\leq i\leq n \} $. Our job is to find the best-fit line to the data. If you have ever plotted a scatter diagram of the data and told yourself "The points roughly fall in a straight line", that's exactly what we are doing here.
The hypothesis function $H_\theta(x):=\theta_0 + \theta_1 x$ defines a straight line. There are infinitely many straight lines that we can draw on the plane by varying the $\theta_1,\theta_0$. But, we want to draw only that straight line that best fits the data, similar to what you'd draw after looking at the scatter plot.
The $J$ function serves this purpose algorithmically. If it finds that the parameters $\theta_0,\theta_1$ are too far away from the correct hypothesis, it will give you a very large cost. Contrariwise, a correct hypothesis has a very small cost. You'll typically do this by minimizing the cost function wrt the $\theta$'s.