I'm new to machine learning and I have problems about two thing in Linear regression. I'll explain what i understand because there may be a problem on that.
$y = θ_0 + θ_1x$, is our formula. We randomly give the values of $B_0$ and $B_1$. And then $$ J(\theta) = \frac{1}{2m}\sum_i [h_\theta(x^{(i)}) - y^{(i)}]^2 $$ This cost function is for calculating the error between real data and our prediction, where our prediction is from $y = β_0 + β_1x$.
Here the problem begins for me.
We have to minimize the cost function. I'm ok with that. But for example, we find the minimum value of $θ_1$, what is the role of gradient descent? I know gradient descent is for minimizing the value of $θ_1$ for example, but I'm not that familiar with it right now.
2. $$ \frac{\partial}{\partial m} J = \frac{2}{N}\sum\limits_{i=1}^N -x_i(y_i - [mx_i + b]) $$ $$ \frac{\partial}{\partial b} J = \frac{2}{N}\sum\limits_{i=1}^N -(y_i - [mx_i + b]) $$
According to this formula, these are the results of derivation. According to gradient descent, we need iteration. But as I know, when we equalize the derivative to zero, we directly get the best $θ_1$. But we are using gradient descent. I'm confused about that.
Thanks for answer...
Gradient descent is not for minimizing the value of $\theta_i$. It is for minimizing $J$. The minimum value of $J$ sits at the values of $\theta=(\beta_0,\beta_1)$ where $\partial J$ vanishes. In other words, gradient descent iteratively alters $\theta$ for $t=1,\ldots,N_{max}$ via: $$ \theta_t = \theta_{t-1} - \eta \nabla J(\theta_{t-1}) $$ or until convergence of $\theta$ to some optimal value, where $\eta$ is the step-size (how fast you want to move). Too fast and you may miss the optimum and oscillate; too slow and you will take forever to converge.
Indeed, gradient descent is not strictly necessary for linear regression. One can always solve directly for the (OLS) optimum $$ \theta = (X^TX)^{-1}X^TY $$ where $X$ is the data matrix. For other, more complex learners, however, it is necessary to use gradient descent (or other iterative techniques, like stochastic gradient descent, for instance).
Nevertheless, notice that for large amounts of high dimensional data, using the formula above is likely to be intractably expensive (and possibly with numerical invertibility issues, though one can use redundancy reduction methods to remedy that). Thus, even for linear regression, gradient descent may be preferable. Further, the normal problem with gradient descent (i.e. getting stuck in local minima) will not happen in this case (i.e. OLS). So there is little reason to avoid it.
Note that there is yet another way to compute the regression via the QR decomposition, which can be useful in some situations.