Probit model question (regression)

236 Views Asked by At

I'm reading a thesis and I need your help to understand the equation below.

$$\Pr(\text{failure}=1 \mid X_1,X_3,X_3,X_4)=\int_{-\infty}^z \varphi(k) \, dk\tag{1}$$

$\varphi(k)$ is the standard normal density and $z=\beta_1 X_1+\beta_2 X_2+\beta_3 X_3+\beta_4 X_4$
We estimate regression (1) through probit.

It means,
first we estimate probability of failure ($z$, in here) by probit test,
and then do the normal regression to estimate $\beta_1,\beta_2, \beta_3,\beta_4$?
(by using $z$ from first stage and we already have $X_1,X_2,X_3,X_4$ values)
We have to regress twice in this model?

1

There are 1 best solutions below

2
On BEST ANSWER

In a probit model the data would typically consist of a list of cases, and for each case one would have observed the values of $X_1,X_2,X_3,X_4$ (the number of such predictors may vary) and whether there was a success or a failure. Let $n$ be the number of cases. One observes $(I_i,x_{1,i},\ldots,x_{4,i})$ for $i=1,\ldots,n$, and the value of $I_i$ would be "success" or "failure", and often this would be represented by $1$ or $0$, the number of successes in the $i$th case. But below I'll do it the other way, for sake of consistency with your notation, i.e. \begin{align} 0 & = \text{success}, \\ 1 & = \text{failure}. \end{align}

One estimates $\beta_1,\beta_2,\beta_3,\beta_4$ by maximum likelihood. That means the estimates $\widehat\beta_1,\widehat\beta_2,\widehat\beta_3,\widehat\beta_4$ would be the values that maximize the probability of the data that were actually observed. To be more explicit: The likelihood function is $$ L(\beta_1,\beta_2,\beta_3,\beta_4) = \left(\prod_{i\,:\,I_i=1} \int_{-\infty}^{\beta_1 x_{1,i}+\cdots+\beta_4 x_{4,i}} \varphi(k)\,dk \right)\left(\prod_{i\,:\,I_i=0} \left(1 - \int_{-\infty}^{\beta_1 x_{1,i}+\cdots+\beta_4 x_{4,i}} \varphi(k)\,dk\right) \right). $$ Each integral is from $-\infty$ to $\beta_1 x_{1,i}+\cdots+\beta_4 x_{4,i}$.

The estimates $\widehat\beta_1,\widehat\beta_2,\widehat\beta_3,\widehat\beta_4$ are the values that maximize this. These would be found by numerical methods; I don't think there's a closed form.

I think typically there would be an additional parameter: one would have $\alpha+\beta_1 x_{1,i}+\cdots+\beta_4 x_{4,i}$ as the upper bound of integration, and the likelihood function would be $L(\alpha,\beta_1,\beta_2,\beta_3,\beta_4)$.

So: No, you don't regress twice.