How do I derive the Beta function using the definition of the beta function as the normalizing constant of the Beta distribution and only common sense random experiments?
I'm pretty sure this is possible, but can't see how.
I can see that
$$\newcommand{\Beta}{\mathrm{Beta}}\sum_{a=0}^n {n \choose a} \Beta(a+1, n-a+1) = 1$$
because we can imagine that we are flipping a coin $n$ times. The $2^n$ unique sequences of flips partition the probability space. The Beta distribution with parameters $a$ and $n-a$ can be defined as the prior over the coin's bias probability $p$ given the observation of $a$ heads and $n-a$ tails. Since there are ${n \choose a}$ such sequences for any $n$ and $a$, that explains the scaling factor, and we know that it all sums to unity since the sequences partition the probability space, which has total measure 1.
What I can't figure out is why:
$${n \choose a} \Beta(a+1, n-a+1) = \frac{1}{n+1} \qquad \forall n \ge 0,\quad a \in \{0, \dots, n\}.$$
If we knew that, we could easily see that
$$\Beta(a + 1,n - a + 1) = \frac{1}{(n+1){n \choose a}} = \frac{a!(n-a)!}{(n+1)!}.$$
For non-negative integers $a, b$ and $t \in [0, 1]$, the expression $t^a (1 - t)^b$ describes the probability of randomly selecting $a+b$ real numbers in $[0, 1]$ such that the first $a$ are in $[0, t]$ and the last $b$ are in $[t, 1]$. The integral $\int_0^{1} t^a (1 - t)^b dt$ then describes the probability of randomly selecting $a+b+1$ real numbers such that the first number is $t$, the next $a$ numbers are in $[0, t]$, and the next $b$ numbers are in $[t, 1]$.
It follows that $ {a+b \choose b} \int_0^1 t^a (1 - t)^b dt$ describes the probability of randomly selecting $a+b+1$ real numbers such that the first number is $t$, some $a$ of the remaining numbers are in $[0, t]$, and some $b$ of the remaining numbers are in $[t, 1]$. But this is the same as the probability that the first number happens to be $(a+1)^{st}$ in order, and this is just $\frac{1}{a+b+1}$. Hence
$$\int_0^1 t^a (1 - t)^b dt= \frac{a! b!}{(a+b+1)!}$$
as desired. I learned this proof through an exercise in a Putnam training seminar; the multidimensional generalization also works.