I am currently studying the textbook Statistical Inference by Casella and Berger. Chapter 11.3.1 Least Squares: A Mathematical Solution says the following:
For any line $y = c + dx$, the residual sum of squares (RSS) is defined to be $$\text{RSS} = \sum_{i = 1}^n (y_i - (c + dx_i))^2 .$$ The RSS measures the vertical distance from each data point to the line $c + dx$ and then sums the squares of these distances. (Two such distances are shown in Figure 11.3.1). The least squares estimates of $\alpha$ and $\beta$ are defined to be those values $a$ and $b$ such that the line $a + bx$ minimizes RSS. That is, the least squares estimates, $a$ and $b$, satisfy $$\min_{c, d} \sum_{i = 1}^n (y_i - (c + dx_i))^2 = \sum_{i = 1}^n (y_i - (a + bx_i))^2.$$ This function of two variables, $c$ and $d$, can be minimized in the following way. For any fixed value of $d$, the value of $c$ that gives the minimum value can be found by writing $$\sum_{i = 1}^n (y_i - (c + dx_i))^2 = \sum_{i = 1}^n ((y_i - dx_i) - c)^2 .$$ From Theorem 5.2.4, the minimizing value of $c$ is $$c = \dfrac{1}{n} \sum_{i = 1}^n (y_i - dx_i) = \overline{y} - d \overline{x}. \tag{11.3.9}$$ Thus, for a given value of $d$, the minimum value of RSS is $$\sum_{i = 1}^n ((y_i - dx_i) - (\overline{y} - d \overline{x}))^2 = \sum_{i = 1}^n ((y_i - \overline{y}) - d(x_i - \overline{x}))^2 = S_{yy} - 2dS_{xy} + d^2 S_{xx}.$$ The value of $d$ that gives the overall minimum value of RSS is obtained by setting the derivative of this quadratic function of $d$ equal to $0$. The minimizing value is $$d = \dfrac{S_{xy}}{S_{xx}}. \tag{11.3.10}$$ This value is, indeed, a minimum since the coefficient of $d^2$ is positive. Thus, by (11.3.9) and (11.3.10), $a$ and $b$ from (11.3.8) are the values of $c$ and $d$ that minimise the residual sum of squares.
The RSS is only one of many reasonable ways of measuring the distance from the line $c + dx$ to the data points. This is equivalent to graphing the $y$ variable on the horizontal axis and the $x$ variable on the vertical axis and using vertical distances as we did above. Using the above results (interchanging the roles of $x$ and $y$), we find the least squares like is $\hat{x} = a^\prime + b^\prime y$, where $$b^\prime = \dfrac{S_{xy}}{S_{yy}} \ \ \ \text{and} \ \ \ a^\prime = \overline{x} - b^\prime \bar{y}.$$
Reexpressing the line so that $y$ is a function of $x$, we obtain $\hat{y} = -(a^\prime / b^\prime) + (1/b^\prime)x$.
Usually the line obtained by considering horizontal distances is different from the line obtained by considering vertical distances. From the values in Table 11.3.1, the regression of $y$ on $x$ (vertical distances) is $\hat{y} = 1.86 + .68 x$. The regression of $x$ on $y$ (horizontal distances) is $\hat{y} = -2.31 + 2.82x$. In Figure 12.2.2, these two lines are shown (along with a third line discussed in Section 12.2). If these two lines were the same, then the slopes would be the same and $b/(1/b^\prime)$ would equal 1. But, in fact, $b/(1/b^\prime) \le 1$ with equality only in special cases. Note that $$\dfrac{b}{1/b^\prime} = bb^\prime = \dfrac{(S_{xy})^2}{S_{xx} S_{yy}}.$$ Using the version of Hölder's Inequality in (4.7.9) with $p = q = 2$, $a_i = x_i - \overline{x}$, and $b_i = y_i - \overline{y}$, we see that $(S_{xy})^2 \le S_{xx} S_{yy}$ and, hence, the ratio is less than 1.
(11.3.8) is that the most common estimates of $\alpha$ and $\beta$ in $E(Y_i \mid x_i) \approx \alpha + \beta x_i$ are given by $b = \dfrac{S_{xy}}{S_{xx}}$ and $a = \overline{y} - b \overline{x}$.
$S_{xx} = \sum_{i = 1}^n (x_i - \overline{x})^2$, $S_{yy} = \sum_{i = 1}^n (y_i - \overline{y})^2$, $S_{xy} = \sum_{i = 1}^n (x_i - \overline{x})(y_i - \overline{y})$.
(4.7.9) is given as follows:
The preceding theorems also apply to numerical sums where there is no explicit reference to any expectation. For example, for numbers $a_i$, $b_i$, $i = 1, \dots, n$, the inequality $$\sum_{i = 1}^n | a_i b_i | \le \left( \sum_{i = 1}^n a_i^p \right)^{1/p} \left( \sum_{i = 1}^n b_i^q \right)^{1/q}, \dfrac{1}{p} + \dfrac{1}{q} = 1, \tag{4.7.9}$$ is a version of Hölder's Inequality. To establish (4.7.9), we can formally set up an expectation with respect to random variables taking values $a_1, \dots, a_n$ and $b_1, \dots, b_n$. (This is done in Example 4.7.8.) An important special case of (4.7.9) occurs when $b_i \equiv 1$, $p = q = 2$. We then have $$\dfrac{1}{n} \left( \sum_{i = 1}^n |a_i| \right)^2 \le \sum_{i = 1}^n a_i^2.$$
How does, as the authors claim, (4.7.9) and $\dfrac{b}{1/b^\prime} = bb^\prime = \dfrac{(S_{xy})^2}{S_{xx} S_{yy}}$ get us that $(S_{xy})^2 \le S_{xx} S_{yy}$ and, hence, the ratio is less than 1?