The problem:
For $n,M, d, q \in \mathbb{Z}$ and $q>1$, with $1\leq d\leq n$, show that if $$\left(M-1\right)\sum^{d-1}_{i=0} {n\choose i}\left(q-1\right)^{i} < q^n$$ holds, then there must exist a $q$-ary $\left[n, M\right]$ code of minimum distance $d.$
(Hint: This is often known as the Gilbert-Varshamov bound for non-linear codes).
My work so far:
From the proof of the Gilbert-Varshamov bound for linear codes, I know that in the linear case if the inequality $$\sum^{d-2}_{i=0} {n-1\choose i}\left(q-1\right)^{i} < q^{n-k}$$ holds, this implies that there exists an $[n,k]$-linear code over $\text{F}_{q}$ through the construction of an $\left(n-k\right)\times n$ matrix with $d-1$ columns being linearly independent. I also can show that the null space of this $\left(n-k\right) \times n$ matrix, let's call it $H$, will have a null space that is a linear code over $\text{F}_{q}$ of length $n$, dimension $k$, and distance $d$.
Should I approach this problem in a similar way? The $\left(M-1\right)$ term is what is really throwing me off.
You are using $[n,\alpha]$ as notation to mean both a linear code (that is, an $\alpha$-dimensional subspace of $\mathbb F_q^n$) and a nonlinear code (arbitrary subset of $\mathbb F_q^n$) containing $\alpha$ codewords. The notation $(n,\alpha)$ is often used to distinguish the latter from the former; that is, a $(n, q^k)$ code that happens to be a linear code would be denoted as a $[n,k]$ code, but if there is a $(n,q^k)$ code that is not linear (e.g. a coset of a $[n,k]$ linear code), many people would not use "$[n,k]$ code" to denote such a nonlinear code.
Settle on which notation you mean, and then count the number of vectors at distance $d-1$ or less from a purported codeword. This is called the Hamming sphere of radius $d-1$ centered at the codeword. If there is an $(n,M-1,d)$ code (whether linear or not is not relevant here), then the Hamming spheres of radius $d-1$ centered at the $(M-1)$ codewords are all disjoint, and if the total number of vectors in these spheres is less than $q^n$, there is at least one vector in $\mathbb F_q^n$ that is at distance $d$ or more from all the $M-1$ codewords, and this vector can be included in the $(n,M-1,d)$ code to create an $(N,M,d)$ supercode.