I suppose I'll give a little context to this...
I was at first really excited by the prospect that both Bernoulli and Taylor's versions of $e^x$ actually ammount to the same thing (where, when you expand Bernoulli's definition $\lim_{n\to\infty}(1+\frac{x}{n})^n$ it simplifies to the Taylor series $\sum_{n=0}^{\infty}\frac{x^n}{n!}$ - unlike, say, trigonometric functions like $\sin(x)$ which look a bit more unsatisfying). However, upon further reflection, to say that one identity 'simplifies' to the other seems almost circular given it presupposes binomial theorem.
So, I decided to do a little scouting online, and found that binomial theorem could be proven using proof by induction. But once again, I felt unsatisfied as it feels as though you've just divined up some formula which works. How did you come up with it in the first place?
Well, that's when I thought of probability theory. We use a binomial expression $^nC_{r}$ when we are looking for the number of combinations we have for arranging things. We can very intuitively derive this formula, however, as I don't want this question to be unecessarily large, I'm not going to go into detail about this here. What counts, is that we can figure out this formula from first principles - we're not getting it from nowhere.
As I'm sure you already know, $^nC_{r}$ is key to figuring out binomial distributions. Once again I'm not going to explaim this, but an example probability calculation becomes $^{10}C_{4}(\frac{1}{6})^4(\frac{1}{6})^{10-4}$ for rolling 4 sixes in 10 total rolls.
According to (Kolmogorov's) axioms of probability, the probability of all total outcomes $P(\Omega)$ is 1. Let's consider the context of total binomial outcomes of 3 trials (I pick this number for simple and illustrative purposes), where we designate the probablity of success, $\frac{1}{5}$, and that of failure, $\frac{4}{5}$, (where the sum of all different outcomes, here success and failure = 1). Well, we know that becuase we have 3 trials, logically, we'll have 4 possibilities: 0 successes, 1 success, 2 successess, or 3 success; we can write out our probability as: $^{3}C_{0}(\frac{1}{5})^3(\frac{4}{5})^0$ + $^{3}C_{1}(\frac{1}{5})^2(\frac{4}{5})^1$ + $^{3}C_{2}(\frac{1}{5})^1(\frac{4}{5})^2$ + $^{3}C_{3}(\frac{1}{5})^0(\frac{4}{5})^3$ = $(\frac{1}{5})^3$ + $3(\frac{1}{5})^2(\frac{4}{5})$ + $3(\frac{1}{5})(\frac{4}{5})^2$ + $(\frac{4}{5})^3$ $= 1$.
This can of course be generalised to $\sum_{k=0}^{n}$$^{n}C_{k}{p}^k{q}^{n-k}$ where $n$ is the total number of trials $k$ the number of successes, $p$ the probability of success, $q$ that of failure.
Now let's consider the binomial expansion of $(a + b)^3$. Well, supposing we know nothing about binomial theory and do it by hand, our algebra nonetheless simplifies to $a^3 + 3a^2b + 3b^2a + b^3$. Hmmm... this looks suspicously like our total probability forumla expanded above. It almost looks as though $a$ could be representing our successes, and $b$ our failures.
So my reasoning is such: it just so happens that in the context of probability, the total probability of your outcomes add up to 1. If $\Omega = \left\{ a, b \right\}$, then, by Kolmogorov's axioms, $a + b = 1$. What we're effectively saying is that for any binomial expansion, we can treat them as 'probability outcomes' where $a + b \ne 1$.
Is this a reasonable proof?
Here is one way to understand the binomial theorem that doesn't use induction. I'll explain the links between this and probability at the end of this post. First, because multiplication is both distributive over addition and commutative, we know that $$ a(x+y)=(x+y)a=ax+ay \, . $$ When expanding a double bracket, the same principle applies. To expand $(a+b)(c+d)$, we can let $k=c+d$, and so $$ (a+b)k=ak+bk=a(c+d)+b(c+d)=ac+ad+bc+cd \, . $$ This justifies the 'FOIL' method for expanding a double bracket: $$ (a+b)(c+d)=\underbrace{ac}_{\text{First}}+\underbrace{ad}_{\text{Outside}}+\underbrace{bc}_{\text{Inside}}+\underbrace{bd}_{\text{Last}} \, . $$ At this point, we spot something interesting. The expansion of $(a+b)(c+d)$ is simply the sum of all the different ways of selecting a single element from each bracket. This pattern repeats for three brackets: \begin{align} (a+b)(c+d)(e+f) &= (ac+ad+bc+bd)(e+f) \\[4pt] &= e(ac+ad+bc+bd)+f(ac+ad+bc+bd) \\[4pt] &=ace+ade+bce+bde+acf+adf+bcf+bdf \, . \end{align} Consider how every time we add on a new bracket, the number of terms in the expansion doubles. When we multiplied $(a+b)(c+d)$ with $(e+f)$, the resulting expansion is simply the previous one with $e$ tacked on at the end of each term, followed by the previous one with $f$ tacked on at the end of each term. This means that it is still the case that the expansion represents the sum of all of the different ways of selecting a single element from each bracket. If you're still not convinced, then notice how the expansion of $(a+b)(c+d)$ is $ac+ad+bc+bd$. These four terms represent all of the different ways you can select a single element from the brackets $(a+b)$ and $(c+d)$. There are eight ways of picking a single element from the brackets $(a+b)$, $(c+d)$, and $(e+f)$. And we can find these eight ways by drawing a tree diagram that 'branches out' three times. When we consider the first bracket, there are two combinations: $a$ and $b$. When we consider the second bracket, there are four combinations: $ab$, $ac$, $bc$, and $bd$. When we consider the third bracket, there are eight combinations: $abe$, $abf$, $ace$, $acf$, $bce$, $bcf$, $bde$, and $bdf$. The eight combinations come from making a choice about whether to tack on $e$ at the end or an $f$ at the end, which is exactly happens when we multiply $(ac+ad+bc+bd)$ with $(e+f)$. Once you accept this, it is not too difficult to understand, at least intuitively, that the expansion of several brackets always results in a sum of all of the different possible combinations.
The expansion of $(a+b)^n$ is then the result of adding up all of the different ways you can select a single element from each bracket. The difference here is that many of the terms are repeated. $$ (a+b)^n = \underbrace{(a+b)(a+b)(a+b)\dots(a+b)}_{\text{$n$ times}} \, . $$ Let's go through the expansion step-by-step. If we pick $a$ zero times and $b$ $n$ times, then the term we end up with is $b^n$. If we pick $a$ $1$ time and $b$ $(n-1)$ times, then the term we end up is $ab^{n-1}$. However, there are $n$ different ways of picking $a$ $1$ time and $b$ $(n-1)$ times. Hence, the term $ab^{n-1}$ is repeated $n$ times over, meaning that we end up with $nab^{n-1}$. If we pick $a$ $2$ times and $b$ $(n-2)$ times, then the term we get is $a^2b^{n-2}$, and the coefficient is $^nC_2$. (By definition, $^nC_r$ is the number of ways you can select $r$ elements out of group of $n$.) The pattern continues. In general, if we pick $a$ $r$ times, and $b$ $(n-r)$ times, then the term that we get is $a^rb^{n-r}$, and the coefficient is $^nC_r$. This gives us the binomial theorem: $$ (a+b)^n = \sum_{r=0}^{n}{n \choose r}a^rb^{n-r} \, . $$ How does this link to this probability? Imagine you're flipping a fair coin ten times. Each individual sequence is as likely as any other. It is just as likely for you to get $\mathrm{HHHHHHHHHH}$ as it is $\mathrm{HHTHHHHTTH}$. However, it is much more likely for you to get $5$ heads as opposed to $0$ heads. Why? Because there are many ways in which a group of $10$ tosses can result in $5$ heads, compared with only one way to get $10$ heads. To be precise, there are $^{10}C_5=252$ ways of getting $5$ heads, but only $1$ way of getting $10$ heads.