I am trying to understand why a specific series converges to the following result, in the context of an optimization problem.
We are given $N$ samples of the variable $y$. We want to find an $x$ that minimizes the mean square error. Thus, the recursive equation becomes:
$$ x_t = x_{t-1} - \lambda_t (x_{t-1} - y_{i_{t-1}}) $$
where $y_{i_{t-1}}$ is the $i$'th sample $y$ that is chosen at time instant $t-1$ during this optimization.
At any rate, I am told that if we let $\lambda_{t} = \frac{1}{t}$, then the final solution $\hat{x}$ converges to:
$$ \hat{x} = \frac{1}{N}\sum_{j=1}^{N}y_j $$
What I cannot figure out, is how we attain this conclusion? Why is it necessary for $\lambda_{t} = \frac{1}{t}$ for this solution to show itself?
Thanks.
There's a number of ways to show this, one way is the following. Assume that $$y_n = x + \epsilon_n$$ where $\epsilon_n\sim \mathcal{N}(0,1)$ is IID.
If you observe $N$ samples of the above, you can write the log-likelihood equation you're trying to solve as:
$$ L(x^*) = -\frac{1}{2}\sum_{n=1}^N (y_n - x^*)^2$$
which implies that $$L'(x^*) = \sum_{n=1}^N (y_n - x^*) = 0 \Rightarrow \frac{1}{N}\sum_{n=1}^N y_n = x^*$$ and $$L''(x^*) = -\sum_{n=1}^N 1 = -N $$
Now, using a first-order Taylor expansion of $L'(x_N)$ around $L'(x_{N-1})$, where $x_N$ denotes the optimal $x$ at time $N$ we get:
$$L'(x_N) = L'(x_{N-1}) + L''(x_{N-1})(x_N - x_{N-1})$$
Since $x_N$ is optimal at time $N$, the LHS equals zero and we can re-arrange to show:
$$x_N = x_{N-1} + [L''(x_{N-1})]^{-1}L'(x_{N-1})$$
However, since $$L'(x_{N-1}) = \sum_{n=1}^N (y_n - x_{N-1}) = \sum_{n=1}^{N-1} (y_n - x_{N-1}) + (y_N - x_{N-1}) = (y_N - x_{N-1})$$ and $$[L''(x_{N-1})]^{-1} = -\frac{1}{N}$$
we see that $$x_N = x_{N-1} -\frac{1}{N}(y_N - x_{N-1})$$
All we've done is taken the maximum likelihood solution and rearranged it to be in recursive form. So $\lambda = t^{-1}$, in your notation, is an outcome of the MLE process. It is commonly referred to as the Fisher Information matrix, and it tells us how much more info we learn about the true value of $x^*$ when we observe a new data point