I am concerned with finding the confidence region for the best fit parameter $\hat{\beta}$ in a linear least squares problem. Suppose we have data
Question setup/background
$$Y = X\breve{\beta} + \epsilon$$
$Y$ is a $n\times 1$ vector of data points, $\breve{\beta}$ is a $p\times 1$ vector of the true model parameters, $\epsilon$ is a $n\times 1$ vector of noise affecting the data and $X$ is the $n\times p$ design matrix. I suppose $X$ is full rank (meaning the rank of $X$ is $p$ assuming $p<n$). Working in the reals. We suppose
\begin{align} \epsilon \sim \mathcal{N}(0, \sigma^2 I) \end{align}
Initially I'll assume $\sigma$ is known but later I may assume $\sigma$ is unknown.
We can perform a linear fit to the data and calculate the fit residuals:
$$ r(\beta) = Y - X\beta $$
And from the residuals calculate the sum of squares parameter for the fit.
$$ S(\beta) = r(\beta)^Tr(\beta) $$
It can be shown that this cost function is minimized when $\beta$ is set to $\hat{\beta}$ with
$$ \hat{\beta} = (X^TX)^{-1}X^T Y = X^+Y $$
Where $X^+$ is the Moore-Penrose Inverse of $X$. From this one can do some algebra to derive three equations. Define
\begin{align} P_X &= XX^+\\ Q_X &= I-XX^+ \end{align}
It can be shown that $P_X$ is a rank $p$ orthogonal projector onto the image of $X$ and $Q_X$ is a rank $n-p$ projector onto the orthogonal complement of the image of $X$. One can derive with some algebra:
\begin{align} A = \frac{1}{\sigma^2}S(\breve{\beta}) &= \frac{1}{\sigma^2}\epsilon^T \epsilon \sim \chi^2_n\\ B = \frac{1}{\sigma^2}S(\hat{\beta}) &= \frac{1}{\sigma^2}\epsilon^TQ_X\epsilon \sim \chi^2_{n-p}\\ C = \frac{1}{\sigma^2}\left(S(\breve{\beta}) - S(\hat{\beta})\right) &= \frac{1}{\sigma^2}\epsilon^T P_X\epsilon \sim \chi^2_p \end{align}
That these combinations are all statistically distributed as $\chi^2$ random variables can be seen from noting that $P_X$ or $Q_X$ can be diagonalized by orthogonal matrices (and an orthogonal transform of a multivariate normal variable is again multivariate normal) and that $P_X$ has $p$ unity eigenvalues (the rest zeros) and $Q_X$ has $n-p$ unity eigenvalues (the rest zeros).
My question is now about finding the confidence regions given these statistical distributions. Suppose we want to find a 95% confidence region.
What I have sometimes seen is that since $C\sim\chi^2_p$ we can construct a confidence region as
\begin{align} R_1 = \left\{\beta: \frac{1}{\sigma^2}(S(\beta) - S(\hat{\beta})) \le \chi^2_{p,0.95}\right\} \end{align}
Where $\chi^2_{p,0.95}$ is the value of the inverse $\chi^2_p$ cumulative distribution that gives 0.95.
Question itself
My question is could we use the following region as a confidence region as well since $A \sim \chi^2_n$:
\begin{align} R_2 = \left\{\beta: \frac{1}{\sigma^2}S(\beta) \le \chi^2_{n,0.95}\right\} \end{align}
If so then I find it strange because I haven't been able to find this formula in the literature. If not could you please explain what is wrong with using this as a confidence region? See below for details about what exactly I'm confused about.
Second similar question
$R_1$ (and possibly $R_2$ depending on the answer to the above question) only works if $\sigma^2$ is known. If $\sigma^2$ is not known then it would be impossible to directly calculate the confidence regions from the data given the equations of above. This can be solved by taking a ratio of two of the above equations:
\begin{align} \frac{\frac{1}{\sigma^2}(S(\breve{\beta}) - S(\hat{\beta}))\frac{1}{p}}{\frac{1}{\sigma^2} S(\hat{\beta})\frac{1}{n-p}} = \frac{S(\breve{\beta}) - S(\hat{\beta})}{S(\hat{\beta})} \frac{n-p}{p} \sim F_{p, n-p} \end{align}
Since this expression is the ratio of two reduced $\chi^2$ statistics. Fortunately $\sigma^2$ has dropped out so this parameter can be calculated from the data to determine confidence regions. A confidence region can be calculated as
\begin{align} R_3 = \left\{\beta: \frac{S(\beta)-S(\hat{\beta})}{S(\hat{\beta})}\frac{n-p}{p} \le F_{p,n-p}^{\alpha} \right\} \end{align}
The question is would the follow confidence region be appropriate:
\begin{align} R_4 = \left\{\beta: \frac{S(\beta)-S(\hat{\beta})}{S(\beta)}\frac{n}{p} \le F_{p,n}^{\alpha} \right\} \end{align}
I wrote this down because
\begin{align} \frac{S(\breve{\beta}) - S(\hat{\beta})}{S(\breve{\beta})} \frac{n}{p} \sim F_{p, n} \end{align}
by the same logic as used for $R_3$.
Summary of my confusion??
I don't think my confusion has so much to do with linear least squares as it does with how to generate confidence regions for any test statistic. It seems to me like the formula is find some equation depending on $\breve{\beta}$ with the property that
\begin{align} F(\breve{\beta}, \theta) \sim D \end{align}
For some known distribution $D$ which is independent of $\breve{\beta}$ or any of $\theta$ and then the confidence region is
\begin{align} R = \left\{\beta: F(\beta, \theta) \le D_{\alpha} \right\} \end{align}
This is the formula I have followed to generate confidence regions $R_2$ and $R_4$, but I don't seem to see those regions anywhere in the literature so I feel I must be misunderstanding something about how to form confidence regions.
Also note
I know that in the linear case some of these expressions can be transformed into expressions involving only $\beta$, $\hat{\beta}$ and $X$, Stuff like $(\beta- \hat{\beta})^T (X^T X)^{-1}(\beta-\hat{\beta})$. I'm holding off on performing this transformation because a large part of the motivation for what I'm trying to understand is how to calculated confidence intervals for non-linear least squares as well, in which case it is sometimes better to look at confidence regions defined by level surfaces of $S(\beta)$ (as I've done in this post) than to look at the asymptotic linear confidence regions.
There are issues with both $R_2$ and $R_4$ which I will explain here.
First $R_2$. I believe it is correct that because $\breve{\beta}$ is distributed as $\chi^2_n$ I think it is correc that
$$ R_2 = \left\{\beta: S(\beta)\le \chi^2_{n,\gamma} \right\} $$
Will contain $\breve{\beta}$ on $\gamma\times 100%$ of realizations. However, the problem is that this region will not necessarily contain $\hat{\beta}$. This is problematic because $\hat{\beta}$ is supposed to be our estimator for $\breve{\beta}$ so it is a bit strange to pick a confidence region that doesn't include that estimator.
$R_4$ is problematic for a separate reason. The statistic for $R_3$ is related to the ratio of $C$ and $B$ above. Each of these is distributed as $\chi^2$, however, importantly one of these depends on $P_X$ and one depends on $Q_X$. From this one could prove that these two statistics are uncorrelated. Furthermore, one could prove that they are in fact independent. It is in fact required that the two $\chi^2$ distributed variables be independent for their ratio to be distributed as an $F$ distribution.
For the region considered in $R_4$, related to the ratio of $B$ and $A$ this requirement is not realized. random variables $B$ and $A$ are not independent so their ratio is not necessarily distributed as an $F$-distributoin.