Suppose to have a set of data $\{y_i, u_i\}_{i=1}^m$, where $y_i \in \mathbb{R}$ and $u_i \in \mathbb{R}^n$.
The claim is that
$$y_i = u_i^\top \theta + \varepsilon$$ where $\theta \in \mathbb{R}^n$ is a vector of unknown parameters and $\varepsilon$ is some kind of zero-average noise (e.g. Gaussian noise).
If I build the followings:
$$Y = \left[\begin{array}{c}y_1\\y_2\\\vdots\\y_m\end{array}\right], U = \left[\begin{array}{c}u_1^\top\\u_2^\top\\\vdots\\u_m^\top\end{array}\right], $$
then I can find a least squares estimation of $\theta$:
$$\hat\theta = (U^\top U)^{-1}U^\top Y.$$
Anyway, suppose that $n$ and $m$ are very huge. Then solving the previous problem numerically is very hard due to memory consumption.
My question is: is there some iterative methods such that I can find $\hat\theta$ iteratively in order to save memory?
I hope that there exists some schema like the following:
$$\begin{align} \hat\theta_1 & = f(y_1, u_1)\\ \hat\theta_2 & = g(\theta_1, y_2, u_2) \\ \hat\theta_3 & = g(\theta_2, y_3, u_3) \\ \vdots\\ \hat\theta_m & = g(\theta_{m-1}, y_m, u_m) = \hat\theta = (U^\top U)^{-1}U^\top Y.\\ \end{align}$$
Yes, you can apply gradient descent techniques to optimize the square loss :
$$L(\theta) = \sum_i (y_i - u_i^T\theta)^2$$
Compute the gradient $\nabla_\theta L$ and apply the gradient descent method of your choice.
The basic scheme is given by the recursion
$$\theta_{n+1} = \theta_{n} - \sigma \nabla_{\theta_n} L$$
For some stepsize $\sigma$ which should be adjusted. However this is clunky, and second order methods should be prefered, for example Newton's method, or LBFGS if you have a large number of variables and memory is a bottleneck, which seems to be the case.
You could also apply stochastic gradient descent or some variant (ADAGRAD, ADADELTA). There is extensive literature on the subject.