I am reading a text that talks about the covariance matrix. I know in general, we want the number of observations to be greater than the number of variables. In OLS regression, we are unable to solve the system of equations if this is not true, as we have more unknowns than systems of equations. This is a problem when trying to solve for $b$ in $(y - Xb)$.
However, a covariance matrix is just a covariance matrix so I am not sure why the number of observations needs to be greater than the number of variables. The author seems to imply that singular covariance matrices are bad, but I am not understanding why...and how singular covariance matrices are related to $n < p$.
Think of this trivial example: Suppose our data points are two-dimensional, so that an observation $(x,y)\in\mathbb{R}^2$ can be plotted as a point in the plane. The goal of OLS in this example is to find the line in the plane that best fits to the two-dimensional data points. If you only have one data point in the plane, how can you possibly expect to find the best line? You can't, because with only one data point in the plane, there are an infinite number of lines that pass through that data point. However, as soon as there are two data points, you can now find a unique best fit line through the data with respect to the OLS cost function. Indeed, with two observations, the OLS cost will be zero, and with three or more data points, the cost will typically be greater than zero, but the OLS cost function still ensures the best fit line is unique and well-defined. Generalizing to higher dimensions, this is the reason why you're being told to have more observations than the ambient dimension of the data.
Another interpretation is this: Suppose you have $m$ observations and the data is $n$-dimensional. If $m<n$, then there may be directions in the $n$-dimensional data-space that haven't yet been "explored" by your samples. Therefore, without information along these directions, how can you expect to accurately model the variance of the data in these directions? In this case, you need more samples in order to fully capture the variance of the data in all directions of the ambient data-space.
We can also see why you need more observations than the dimension of the data in a rigorous mathematical sense. In particular, suppose $X\in\mathbb{R}^{m\times n}$ is the "input" data and $y\in\mathbb{R}^m$ is the "output" data. The number of observations is $m$, and the dimension of the inputs is $n$. The goal of OLS is to find a linear model, $\hat{y} = x^\top \beta$, such that the prediction $\hat{y}$ closely matches what the true output associated with $x$ would be. Searching for this model can be formulated as the following optimization problem: \begin{equation*} \text{minimize}_{\beta\in\mathbb{R}^n} ~ f(\beta) = \|y-X\beta\|_2^2. \end{equation*} Note that OLS uses the 2-norm as its notion of "cost". The squared 2-norm is particularly nice since it is differentiable and gives us a closed-form solution. In particular, we have that \begin{equation*} f(\beta) = (y-X\beta)^\top(y-X\beta) = y^\top y - 2y^\top X\beta + \beta^\top X^\top X\beta. \end{equation*} The gradient is \begin{equation*} \nabla f(\beta) = -2X^\top y + 2X^\top X\beta, \end{equation*} and so the first-order optimality condition $\nabla f(\beta^*)=0$ shows that the optimal solution satisfies \begin{equation*} X^\top X\beta^* = X^\top y. \end{equation*} This is called the "normal equation". The normal equation has a unique solution when $X^\top X$ is invertible, and the solution is given by \begin{equation*} \beta^* = (X^\top X)^{-1}X^\top y. \end{equation*} In order for $X^\top X$ to be invertible, it must be the case that $m>n$, i.e., we must have more observations than the ambient data-space dimension. Note that $\frac{1}{m}X^\top X$ is precisely the sample covariance matrix (assuming the data has been centered).