The gradient descent algorithm is given as :
repeat {
$$\displaystyle \theta_j := \theta_j - \frac{1}{m} \alpha \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j $$
}
Given these values :
x,y
1,1
2,2
3,3
4,4
$\alpha$ (learning rate) $= 1$
$m$ (number of training exmaples) $= 4$
$h_\theta(x) = \theta_0 + \theta_1x_1$
Setting $\theta_0$ to $0$ and $\theta_1$ to $1$
Is this a correct implementation of this algorithm ? :
$$\theta_1 := \theta_1 - (1) \frac{1}{4}(0 + ((1)(1) - 1)+ ((1.5)(2) - 2)+ ((2)(3) - 3)+ ((2.5)(4) - 4)$$
This excludes the $x_j^{(i)}$ part as I'm not sure what value this should take. As $\theta_j$ is the feature number and there is just one feature in this example : $x$. Therefore $x_j^{(i)}$ should be 1 but this does not make sense as this just makes use of 1 feature parameter ?
Update :
Take this extended example, is still correct ? :
Updated gradient descent algorithm to be more explicit : replacing $h_\theta(x^{(i)})$ with $h_\theta(x_j^{(i)})$
repeat {
$$\displaystyle \theta_j := \theta_j - \frac{1}{m} \alpha \sum_{i=1}^m (h_\theta(x_j^{(i)}) - y^{(i)}) x^{(i)}_j $$
}
Data set with 2 examples and each examples has 3 features :
$$\begin{array}[t]{c|c|c} x_1 & x_2 & x_3 & y\\ \hline 1 & 2 & 1& 1\\ 4 & 2 & 5& 2 \end{array} $$
As x now contains multiple features the hypothesis function is now : $h_\theta(x) = \theta^TX$
Initialising the $\theta$ matrix : $$\begin{array}[t]{c|c|c} \theta_0 &\theta_1 & \theta_2 & \theta_3 \\ \hline 0 & 1 & 2 & 1\\ 0 & 4 & 2 & 5 \end{array} $$
In this example have set learning rate to .5
1'st iteration :
$$ \theta_{0}:= -\frac 12 (.5) [( (1*4) + (4*1)+ (2*2) - 1)+ ((2*2)+ (5*1)+ (1*5) - 2)](1)= -5.75 $$ $$ \theta_{0}=-5.75 $$
How is $\theta_{1}$ calculated ?
Update 2 :
each theta value has been picked at random
$ \theta=\begin{bmatrix} 1\\ 3\\ 2\\ 1 \end{bmatrix} $
Dataset including column of ones :
$$\begin{array}[t]{c|c|c|c} x_0 & x_1 & x_2 & x_3 & y\\ \hline 1 & 1 & 2 & 1& 1\\ 1 & 4 & 2 & 5& 2 \end{array} $$
Again :
gradient descent algorithm - including parenthesis of what is summed :
repeat {
$$\displaystyle \theta_j := \theta_j - \frac{1}{m} \alpha \bigg(\sum_{i=1}^m ((h_\theta(x_j^{(i)}) - y^{(i)}) x^{(i)}_j)\bigg) $$
}
$h_\theta(x) = \theta^TX$
After the first iteration there should be computed values for : $ \theta_{0}\theta_{1}\theta_{2}\theta_{3}$ ?
Is this then correct : ?
$x_0^{(0)}$=1
$x_0^{(1)}$=1
$x_1^{(0)}$=1
$x_1^{(1)}$=4
$x_2^{(0)}$=2
$x_2^{(1)}$=2
$x_3^{(0)}$=1
$x_3^{(1)}$=5
$$ \theta_{0}:= -\frac 12 (.5) [( (1*1) + (3*1)+ (2*2) + (1*1)-1*1)+ ((1*1)+ (3*4)+ (2*2) + (1*5) - 2 * 1)] $$ $$ \theta_{1}:= -\frac 12 (.5) [( (1*1) + (3*1)+ (2*2) + (1*1)-1*1)+ ((1*1)+ (3*4)+ (2*2) + (1*5) - 2 * 4)] $$ $$ \theta_{2}:= -\frac 12 (.5) [( (1*1) + (3*1)+ (2*2) + (1*1)-1*2)+ ((1*1)+ (3*4)+ (2*2) + (1*5) - 2 * 2)] $$ $$ \theta_{3}:= -\frac 12 (.5) [( (1*1) + (3*1)+ (2*2) + (1*1)-1*1)+ ((1*1)+ (3*4)+ (2*2) + (1*5) - 2 * 5)] $$
So every theta calculation is same except $x_j^{(i)}$ is updated
Firstly, let's make the convention that $x = x_1$ and $x_1^{(i)}$ expresses the value of $x_1$ at the $i-$ row (see example) or equivalently $x_1^{(i)}$ is the value of $x_1$ of the $i-$th training example. We notice that $x^{(1)}_1 = 1.$ We denote $^{k}\theta_j$ the value of $\theta_ j$ after $k$ updates (i.e. after $k$ repetitions of the algorithm). After one update we have: $$ \ ^{1}\theta_{0}:= \, ^{0}\theta_{0} - \frac 14 \bigg[\left(^{0}\theta_0+\,^{0}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_0^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(2)}-2\right)\cdot x_0^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_0^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_0^{(1)} \bigg]= 0, $$ since $^{0}\theta_0 = 0,$ $^{0}\theta_1 = 1$ and $x_0^{(i)} = 1.$ $$ \ ^{1}\theta_{1}:= \, ^{0}\theta_{1} - \frac 14 \bigg[\left(^{0}\theta_0+\,^{0}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_1^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(2)}-2\right)\cdot x_1^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_1^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_1^{(1)} \bigg]= 1. $$
Thus, $^{1}\theta_1 = 1$ and $^{1}\theta_0 = 0.$ Notice that if we want to proceed to the next iteration, we need both $\theta_0$ and $\theta_1$ which we found at the previous step. So: $$^{2}\theta_1 = : \, ^{1}\theta_{1} - \frac 14 \bigg[\left(^{1}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_1^{(1)} + \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(2)}-2\right)\cdot x_1^{(1)} + \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_1^{(1)} + \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_1^{(1)} \bigg]= 1$$
So, regardless how many updates we apply, the value of $\theta_1$ will be constantly equal to $1,$ since at every iteration we have $\theta_0 = 0$ and $\theta_1 = 1.$
About update 2: Here is What I would do if I were in your shoes. First of all, I would calculate separately $h_\theta(x^{(1)})$ and $h_\theta(x^{(2)}),$ where $\theta $ is our initial vector $\theta=[ 1 \quad 3 \quad 2 \quad 1]^T.$
We have:
$$h_\theta(x^{(1)})= 1\cdot 1 + 3\cdot 1 +2\cdot 2 + 1\cdot 1 = 9 $$
$$h_\theta(x^{(2)})= 1\cdot 1 + 3\cdot 4 + 2\cdot 2 + 1\cdot 5 = 22. $$
Thus, if we implement the algorithm, we get: $$\begin{array}[t]{l} \theta_0 = 1-\frac{0.5}{2}\cdot \left[(9-1)\cdot 1 +(22-2)\cdot 1 \right]=-6\\ \theta_1 = 3 - \frac{0.5}{2}\cdot\left[(9-1)\cdot 1 + (22-2)\cdot 4\right]=-19\\ \theta_3 = 2 -\frac{0.5}{2}\cdot \left[(9-1)\cdot 2+(22-2)\cdot 2\right]=-12\\ \theta_4 = 1-\frac{0.5}{2}\cdot \left[(9-1)\cdot 1+(22-2)\cdot 5\right]=-26 \end{array} $$
Thus, after one update the new $\theta=[-6 \quad -19 \quad -12 \quad -26]^T.$
If you want to apply the algorithm once again, evaluate the new $h_\theta(x^{(1)})$ and $h_\theta(x^{(2)})$ and proceed as before.
Notice that the prof uses the convention: $$\begin{array}[t]{c | c | c} x_0 & x_1 & x_2 & x_3\\ \hline x_0^{(1)} = 1 & x_1^{(1)} = 1 &x_2^{(1)} = 2 & x_3^{(1)} = 1\\ x_0^{(2)} = 1 & x_1^{(2)}=4 & x_2^{(2)} = 2 & x_3^{(2)} =5 \end{array} $$