Total Training Loss

176 Views Asked by At

How would I go about calculating the total training loss, and the optimal parameter values of $w$ (given as an optimal weight vector)?

Given an expression such as:

$$\sum_{n=1}^N (w^Tx_n-t_n)^2$$

How would I engage such a problem? Dealing with total training loss, I usually see expressions of the form: $f(x;w) = w_o+w_1x$ (using linear regression) for computing the optimal model from: $$\frac{1}{N} \sum_{n=1}^N ((w_o + w_1x) - t_n)^2$$

But I cannot fit the expression i provided above, into this expression, or should i consider the $w^Tx_n$-term as $w_0x_0 + w_1x_1$? It confuses me a lot.

I would "normally" solve an expression by finding its partial derivatives, with respect to each value of $w$ (given the optimal model expression from above), and then evaluate these at 0 to find the optimal values. Then I would apply the second derivative test to find a minimum.

Can anyone explain how a similar approach would look like given an expression as the one above?

1

There are 1 best solutions below

3
On BEST ANSWER

"or should i consider the $w^Tx_n$-term as $w_0x_0 + w_1x_1$?"

Almost. $w$ is a vector and $w^T$ is the transpose of it. So, what you want is to find the optimal values of $w$ that minimizes the following function

$$\sum_{n=1}^N (w^Tx+w_o-t_n)^2$$, where $x$ is also a vector and $w_0$ is a scalar. Which can be written as:

$$\sum_{n=1}^N error_n^2$$, which we can call the cost function.

Via the chain rule, you can take the derivative of this cost function with respect to $w$:

$$\frac{dcostfunction}{dw} = \frac{dcostfunction}{derror}*\frac{derror}{dw} $$

, which gives us:

$$2* error * x$$ for $w$

and for $w_o$:

$$2*error$$

Now, you can solve it via what we call the 'normal equations', but a popular approach is to use gradient descent, where the vector $w$ just moves iteratively via a learning rate towards the negative gradient descent. So in our case, we should iteratively update our vector $w$ via:

$$w = w + error * learning rate * x $$ and $$w_0 = w_0+ error * learning rate $$

So, what we do is: we make a guess in the form of $$w^Tx+w_o$$ and after every guess, we calculate the error and update our weight vector $w$ and $w_0$ as I showed, which will converge to the optimal values of $w$ and $w_0$. This is pre-programmed in many libraries and easy to use in e.g. Python.