Minimizing an error function by deriving a system of linear equations

3.9k Views Asked by At

Consider the following formula:

$$E(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^{N}\{y(x_n,\mathbf{w})-t_n\}^2$$

where $\mathbf{w}$ is a vector of weights; $x_n$ and $t_n$ come from two vectors of length $N$; and $y$ is a polynomial:

$$y(x,\mathbf{w}) = \sum_{j=0}^M w_jx^j$$

My task is to show a system of equations which yield weights $\mathbf{w} = \{w_i\}$ that minimize E. I reckoned I should differentiate and set the derivative to 0:

$$ \frac{dE}{dw} = \sum_{n=1}^{N}\{y(x_n,\mathbf{w})-t_n\}\times\frac{dy}{dw}$$

$$ \frac{dE}{dw} = \sum_{n=1}^{N}\{y(x_n,\mathbf{w})-t_n\}\times\sum_{i=0}^{M}x_n^j$$

$$\sum_{n=1}^{N}\{\sum_{j=0}^{M}w_jx_n^j-t_n\}\times\sum_{i=0}^{M}x_n^j = 0$$

The solution says to do what I did, except differentiate "with respect to $w_i$". It offers the following expression:

$$\sum_{n=1}^{N}(\sum_{j=0}^{M}w_jx_n^j-t_n) x_n^i = 0$$

I take it that each $i$ yields another equation, hence this approach leading to a system of equations. There are two things I don't understand:

  1. Why is there not a summation over the $x_n^i$ values at the end? I thought differentiating $y$ would remove the weights but retain the summation.

  2. The inner summation uses a $j$ though the outer uses a $i$. Why are they not the same symbol? Though I know if they were both $j$ we would be left with just one equation, I don't understand how they can be different.

1

There are 1 best solutions below

0
On BEST ANSWER

You're minimizing a function $E(\mathbf{w})$, where $\mathbf{w}$ is a vector, presumably of size $N$. You can't just treat $\mathbf{w}$ as a variable $w$ and differentiate with respect to it: it's a vector. So instead you need to take partial derivatives for each $w_i$ and set them equal to 0 (Or equivalenty, you're looking for solutions $\nabla E(\mathbf{w})=0$, where the gradient is with respect to $w_1,\cdots,w_N$). When $i\leq M$, you have

$$\frac{\partial y(x,\mathbf{w})}{\partial w_i}=x^{i},$$

and if $i>M$, then the derivative is just 0. So by the chain rule,

$$0=\frac{\partial E(\mathbf{w})}{\partial w_i}=\sum_{n=1}^N(y(x_n,\mathbf{w})-t_i)x_n^{i}.$$

Now plug in the definition of $y$.