Deriving the BISG Bayesian probability formula

187 Views Asked by At

The paper

Bureau, C. F. P. (2014). Using publicly available information to proxy for unidentified race and ethnicity: A methodology and assessment. Washington, DC: CFPB, Summer. [1]

provides a methodology to impute race from surname and geography, but I am struggling to understand how the formula for this is derived. Technical Appendix A states

Apply Bayes’ Theorem to calculate the likelihood that an individual with surname s living in geographic area g belongs to race or ethnicity r. This is described by $$ \text{Pr}(r|g,s) = \frac{p(r|s) q(g|r)}{\sum_{r \in R} p * q} $$

but I am struggling to derive this with my limited knowledge of Bayes theorem.

Can someone please provide a step-by-step derivation of this formula?

[1] https://files.consumerfinance.gov/f/201409_cfpb_report_proxy-methodology.pdf

Thank you!

1

There are 1 best solutions below

0
On BEST ANSWER

Second Document: We can calculate the probability of a race given surname and geography using Bayes theorem as follows: \begin{align*} \mathbb P[R|G,S] &= \frac{\mathbb P[R,G,S]}{\mathbb P[G,S]}\\ \mathbb P[R|G,S] &= \frac{\mathbb P[R,G,S]}{\sum\limits_{R_i} \mathbb P[R_i,G,S]}\\ \end{align*}

First Document (linked in question) Assume: (1) $\mathbb P [G|R,S] = \mathbb P [G|R]$

\begin{align*} \mathbb P[R|G,S] &= \frac{\mathbb P[R,G|S]}{\mathbb P[G|S]}\\ \mathbb P[R|G,S] &= \frac{\mathbb P[R,G|S]}{\sum\limits_{R_i} \mathbb P[R_i,G|S]}\\ &= \frac{\mathbb P[G|R,S]\times \mathbb P[R|S]}{\sum\limits_{R_i} \mathbb P[G|R_i,S]\times \mathbb P[R_i|S]}\\ \end{align*} Now we use the assumption that $\mathbb P [G|R,S] = \mathbb P [G|R]$, whence: \begin{align*} \mathbb P[R|G,S] &= \frac{\mathbb P[G|R]\times \mathbb P[R|S]}{\sum\limits_{R_i} \mathbb P[G|R_i]\times \mathbb P[R_i|S]}\\ \end{align*}

This is the equation linked in the paper.


PS: It's hard to reverse engineer the assumptions under which some statement holds