Basic background
Hi, I'm relatively new to statistics and mathematics stack exchange so please bear with me.
I'm trying to learn about the probit and logit models where the observables $y$ can only take the binary values $y = 0$ (failure) or $y = 1$ (success) for a given dependent variable $x$. I'm primarily interested in features associated with fitting a logit and probit curve to a set of data of the form $\{x_i,y_i\}.$
The starting point seems relatively straightforward in that the probability of success given some particular value of $x$ can be expressed as $$P(y=1|x) = F(\alpha + \beta x)$$ where $F$ is a linear transformation used to map $\alpha + \beta x$ to the probability interval $[0,1]$. For the logit and probit models we choose the functions $$F(z)\equiv \Lambda(z) = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}$$ and $$F(z)\equiv \Phi(z) = \int_{-\infty}^z \phi(u) du$$ respectively. Here $\phi$ is the normal pdf and $\phi$ the normal CDF. A plot reveals that the curves look very similar, so my guess is the choice of a probit and logit model is relatively unimportant for this one dimensional example. Yet, I have read and derived for myself that the probit model emerges naturally when considering latent variables with normally distributed error terms.
- My first question is whether or not the choice of logit vs probit is unimportant or are there good reasons to choose one over the other in modelling certain situations?
Curve fitting
As far as I understand the parameters $\alpha$ and $\beta$ are usually derived by using maximum likelihood estimation (which I know very little about). Concretely the log--likelihood can be expressed as $$l = \sum_{i=1}^N y_i \ln \left[F(\alpha + \beta x_i)\right] + (1-y_i) \ln\left[1-F(\alpha + \beta x_i)\right]$$ and $\alpha$ and $\beta$ are estimated by solving the equations $$\frac{\partial l}{\partial \alpha} = 0 \quad \frac{\partial l}{\partial \beta} = 0$$ simultaneously by some numerical method which is standardly implemented in programs such as R, mathematica, matlab, etc.
So now I imagine that I can use some data of the form $\{x_i,y_i\}$ to obtain some estimates for $\hat{\alpha}$ and $\hat{\beta}$ in some numerical program. Some, I imagine, standard questions then pop to my mind:
- What is the standard error of $\{\alpha,\beta\}$ and how is it defined/calculated?
- What is the 95% confidence interval of $\{\alpha,\beta\}$ and how is it defined/calculated?
- I can use my fitted model $\hat{p}=\hat{F}(\hat{\alpha}+\hat{\beta}x)$ to calculate probabilities of certain $x$ values not included in my dataset. I would then obtain a $(x,\hat{p})$ graph. But how do I calculate the 95% confidence interval of the predicted probabilities that my model outputs? I imagine graphically it would look something like this, where the upper and lower curves are the confidence intervals. The middle curve is the predicted probabilities.
The penultimate question seems to be handled to some extent in this link. However, I emphasize that the details are rather unclear to me. The lower (upper) confidence interval seems to be expressed on a form $$\text{Lower (upper) confidence interval} = \text{estimated model}\,\hat{p} - (+) 1.96\sqrt{\chi^T C\chi}$$ similar to the standard expression for the 95% confidence interval of a normal distribution. Here $\chi$ is a vector presumably related to the datapoints and $C$ is some matrix called the covariance matrix.