Consider a least squares (LS) probelm $$\mathbf{x}_\text{LS} = \arg \min_{\mathbf{x}\in\mathbb{R}^d}\Vert \mathbf{y-Ax} \Vert_2^2,$$ where $\mathbf{y}\in\mathbb{R}^n$, $\mathbf{A}\in\mathbb{R}^{n\times d}$ and $\mathbf{x}\in\mathbb{R}^d$.
We can use both normal equation and gradient descent to solve it.
Suppose that $\mathbf{A}$ is full-column rank, the solution of normal equation can be written as $$\mathbf{x}_\text{LS} = \left( \mathbf{A}^\top\mathbf{A} \right)^{-1}\mathbf{A}^\top\mathbf{y}.$$ The computational complexity of $\mathbf{C} = \mathbf{AA^\top}$ is $\mathcal{O}\left( nd^2 \right)$, $\mathbf{d} = \mathbf{A^\top y}$ is $\mathcal{O}\left( nd \right)$, the Cholesky factorization of $\mathbf{C} = \mathbf{GG^\top}$ is $\mathcal{O}\left( d^3/3 \right)$, and solution of $\mathbf{GZ=d}$ and $\mathbf{G^{\top} x_\text{LS} = z}$ is $\mathcal{O} \left( d^2 \right)$. So the total computational complexity of normal equation is $$\mathcal{O}\left( nd^2 + nd + d^3/3\right)$$ For the gradient descent, the solution can be written as $$\mathbf{x}^k = \mathbf{x}^{k-1} - 2\mu\left(\mathbf{A^\top A}\mathbf{x}^{k-1} - \mathbf{A^\top y}\right),\ k=1,2,\dots,$$ where $\mu>0$ is the step size.
The computational complexity for each iteration is as follows.
$\mathbf{C = A^\top A}$ is $\mathcal{O}\left( nd^2 \right)$, which can be computed only once since we can store it in cache. $\mathbf{A^\top y}$ is $\mathcal{O}\left( nd \right)$. $\mathbf{C}\mathbf{x}^{k-1}$ is $\mathcal{0}\left( d^2\right)$. And a subtraction for $\mathbf{x}^{k-1} - 2\mu\left(\mathbf{A^\top A}\mathbf{x}^{k-1} - \mathbf{A^\top y}\right)$ is $\mathcal{O}\left(d\right)$. So the total computational complexity is $$\mathcal{O}\left( nd^2 + nd + T\left( d^2 + d\right)\right),$$ where $T$ is the total iteration number. Here is the first question: In Andrew Ng's course on Coursera, the computational complexity of gradient descent for solving least squares is $\mathcal{O}\left( d^2\right)$, which should be calculated using this method. However, why can he ignore the term $nd^2$.
We can also rewrite gradient descent as $$\mathbf{x}^k = \mathbf{x}^{k-1} - 2\mu\mathbf{A^\top}\left(\mathbf{ A}\mathbf{x}^{k-1} - \mathbf{y}\right),\ k=1,2,\dots,$$ where we take out $\mathbf{A}^\top$. By doing so, we can easily compute the total computational complexity is $$\mathcal{O}\left( T\left( 2nd+n+d \right) \right),$$ where $T$ is the total iteration number.
Here is the second question. So far, we have three computational conplexity, which depend on $T,n,d$. So my question is:
In gradient descent method, which complexity is better? And is there a minimum complexity among these three complexities, or dose we need to judge them case by case based on $T,n,d$? On the other words, in what scenarios should we use normal equation, and in what scenarios should we use gradient descent? For example, if $T$ is larger than $n,d$, is it better to use normal equation to solve the problem instead of gradient descent?