As you may know Bayesian Information Criterion (BIC) can be used in model selection for linear regression: The model which has the min BIC is selected as the best model for the regression. BIC formula is given by:(https://en.wikipedia.org/wiki/Bayesian_information_criterion)
$$BIC(M)=k\log(n)-2\log(\bar{L})$$
or for linear regression:
$$BIC(M)=k\log(n)+n*\log(RSS/n)$$
where $\bar{L}$ is the maximized value of the likelihood function of the model, i.e. $\bar{L}=p(x|M,\theta)$, $k$ is the number of parameters, i.e. independent variables, in the regression and $n$ is the number of data points.
I am looking for the derivation of it. I googled but could not find a document explaining the derivation of BIC for linear regression. I tried to derive the formula myself but I get confused about the model: what is my model, what am I trying to maximize, what is $\theta$?
Can you please provide any information regarding the derivation of BIC for linear regression please? Thanks.
In case somebody is looking for the derivation of the BIC formulation for linear regression here it is.
assuming that $Y$ depends on $X_i$ s a linear relationship can be formulated as:
$$Y=\beta_0+\beta_1X_1+\beta_2X_2+\dots+\beta_nX_n+\epsilon=f(X)+\epsilon$$
where $\epsilon$ is normal variable with zero mean and a variance of $\sigma$. We are trying to estimate the $\beta$ coefficients and there may be multiple regressions models. If this is the case BIC can be used for model selection.
From the regression equation $\epsilon=Y-f(X)$; since $\epsilon$ is assumed to be Gaussian and i.i.d with zero mean and a variance of $\sigma$, likelihood of $\epsilon$ can be written as:
$$ L=\prod\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{(Y_i-f_i(X)^2)}{2\sigma^2}) $$
When the multiplication is done and ignoring the $\pi$ variable we obtain:
$$ L=\frac{1}{\sigma^n} \exp (-\frac{\sum (Y_i-f_i(X))^2}{2\sigma^2})=\frac{1}{\sigma^n} \exp (\frac{-RSS}{2\sigma^2}) $$
When we take the derivative of $L$ wrt $\sigma$ and equate to zero we obtain $\sigma^2=\frac{RSS}{n}$. Putting this value in $L$ to obtain its max value, i.e. $\bar{L}$ we obtain
$$ \bar{L}=L|_{\sigma^2=\frac{RSS}{n}}=(\frac{RSS}{n})^{-n/2}*\exp(-n/2) $$
and the log of $\bar{L}$ is
$$ \log(\bar{L})=-\frac{n}{2}log(RSS/n)-n/2 $$
and the -2*log of $\bar{L}$ is
$$ \log(\bar{L})=nlog(RSS/n)+n $$
which is the second part of the BIC formula for regression. I believe $n$ in the derivation is ignored since it is not associated with any variable.
The first part of BIC for linear regression directly comes from the BIC definition.