I have question on probability of default as it relates to a set of loans vs individual loans. I am unable to reconcile the following two viewpoints:
Let's say I have 100 loans and I observe 20 defaults and hence, I define my PD to be 20%. Basis this, I want to know what is the probability of default for a particular loan.
I came up with the following two setups:
Setup 1
I simply run some simulations and find the number of loans that default on average. Here's some R code:
N <- 100
nDef <- 20
pd <- nDef/N
nRandom <- 10000
out <- NULL
for(i in 1:nRandom){
r <- rbinom(N, 1, pd)
out <- rbind(out, table(r))
}
apply(out, 2, median)
The output is 0: 80, 1: 20 which supports the idea that on average I expect a particular loan to default with a probability of 20%.
Setup 2
I could think of doing this another way: (R code below)
(pd^n * (1-pd)^(N-n)) * choose(N, n)
But doing this gives me ~10% - What am I doing wrong here?
If the probability that a given loan turning out to be bad is $\text{constant}$ 20% then you can calculate the probability to get $n$ bad loans if you select $N$ loans. This can be made by using the binomial distribution.
$$P(X=n)=\binom{N}{n}\cdot 0.2^n\cdot (1-0.2)^{N-n}$$
Thus the probability to observe 20 bad loans out of 100 is
$$P(X=20)=\binom{100}{20}\cdot 0.2^{20}\cdot (0.8)^{80}\approx 9.93\%$$
But you can calculate the probability to observe 57 bad loans out of 100 as well, for instance.
This constant probability is an assumption, which has to be made to apply the binomial distribution above. The condition is that you have a large number of loans (population). In this case the ratio of number of bad loans and number of all loans at the population is $20\%$. Then the probability that you draw a bad loan at every drawing is approximately $pd=0.2$, if the population is large. If the population is not large, then you have to use the $\text{hypergeometric distribution}$.