On Wikipedia, it gives the following statement for the theorem:
If a function $x(t)$ contains no frequencies higher than B hertz, it is completely determined by giving its ordinates at a series of points spaced $\frac 1 {2B}$ seconds apart.
What is the precise mathematical version of the theorem? I.e. something like: Let $x:\mathbb R\to \mathbb R$ be a twice differentiable function. Then if the Fourier transform .... then there is a ....
The classical formulation of the Shannon sampling theorem, which If I recall actually predates Shannon, is as follows:
Theorem (Shannon): Let $f\in L^1(\mathbb{R})$ such that $f$ has 'band limit' B, i.e. $\text{supp}{\hat{f}}\subseteq[-B,B] $, then $$f(x) = \sum_{k\in\mathbb{Z}} f\left(\frac{k}{2B}\right)\text{sinc}\left(2Bx-k\right) $$ holds in the $L^2$ sense. Here $\text{sinc}(x)$ refers the the 'normalized' version, i.e. $\frac{\sin\pi x}{\pi x}$, rather than $\frac{\sin x}{x}$. This result follows by expanding $\hat{f}$ as a Fourier series over $L^2[-B,B]$, recovering Fourier coefficients, and applying Fourier inversion. The right hand side is known Whittaker–Shannon interpolation formula, and gives an explicit reconstuction for the 'signal'. Notice that $\text{sinc}(x)$ corresponds to $\text{rect}(\xi)$, a 'rectangular pulse', when taking Fourier transforms.
Now we can also explain over/subsampling, and anti-aliasing using mathematical language.
Proposition (anti-aliasing): Let $f,g\in L^1(\mathbb{R})$ and $B>0$, and further assume that $\hat{g} = \hat{f} \chi_{[-B,B]}$, then $g$ minimizes the $L^2$ distance $\int_\mathbb{R} \lvert f-h\rvert^2$ among all functions $h$ with band limit $B$.
It's clear from Shannon's theorem that even if $B$ is greater than the Nyquist frequency, the Whittaker-Shannon formula still holds, so oversampling does not hurt reconstruction of signals.For an interesting tie-in to complex analysis, there is the Paley-Wiener theorem, as follows:Define the Paley-Wiener space (modulo some factors of $2\pi$ which I may have forgotten) as $$PW_R := \{ u :\mathbb{C}\to\mathbb{C}\,\text{entire}\,:\,\lvert u(z)\rvert \leq C \exp(R|z|)\}$$ and suppose $u \in PW_R$ such that $f:=u|_\mathbb{R} \in L^2(\mathbb{R})$, then $u\in PW_R$ iff $f$ has band limit $R$. But wait, it get's better. Define the Bernstein space $$B_R = \left\{u\, \text{entire}\,:\, \sup_{t\in\mathbb{R}}e^{-R|t|}\int_{\mathbb{R}} |f(x+it)|^2\,dx<\infty\right\}$$ By the Phragmén–Lindelöf principle, these are actually the same space! Amazing isn't it. But you might ask, what practical use are Paley-Wiener spaces? Well, the Paley-Wiener-Levinson theorem provides a generalization of the above ideas to non-uniform sampling.
Theorem (Levinson): Suppose $\{t_n\}_{n\in\mathbb{Z}}$ is a sequence of sample points such that $\sup_{n\in\mathbb{Z}} |t_n-n|<\frac{1}{4}$, and $f\in B_{\pi}$, then $$f(z) = \sum_{n\in\mathbb{Z}}f(t_n) \frac{G(z)}{G'(t_n)(z-t_n)} $$ where $G(z) = (z-t_0)\prod_{n\geq 1}(1-\frac{z}{t_n})(1-\frac{z}{t_{-n}})$, with locally-uniform convergence. Note that $G(z)$ is essentially the infinite analogue of a Lagrange polynomial interpolant. Furthermore, If we have uniform spacing, and $t_0=0$, $G(z)$ is a Weierstrass product for sine, so we recover the usual Whittaker–Shannon.
Now it turns out you can (sometimes) beat the Shannon theorem, via probabilistic methods (c.f. this paper by Emmanuel Candès, Justin Romberg and Terence Tao, and this one by David Donoho, for an introduction to compressed sensing). Results like this have been used to develop image processing techniques; this is closely related to convex optimization, linear and non-linear programming, etc...