Wikipedia says at most 5 axioms of power set, separation (of course a schema), extensionality, pairing, and infinity are needed, but to me it looks like we only need the first three.
Define $|A| \leq |B|$ as “there exists a bijection from $A$ onto a subclass of $B$”, which I'm pretty sure is equivalent to saying there's an injection. Define $|A| < |B|$ as $|A| \leq |B|$ and $|B| \nleq |A|$ (since we don't necessarily have $|{\leq}|$ antisymmetric).
The axiom of infinity isn't even needed, since the theorem has nothing to do with infinite sets specifically (that's the beautiful thing), and it would apply just as well to any smaller (Dedekind-finite) infinite set as to $\omega$, so singling out that specifically seems kind of arbitrary. Anyway, here's my argument:
(1) The converse partition principle (or maybe “converse axiom of choice”) states that: if $\exists a \in A$, for every injection $f:A \hookrightarrow B$ there is a left inverse $g:B \twoheadrightarrow A$ (proof: define $g(x) = \operatorname{Im}^{-1}_{f}(x)$ if it exists and $g(x) = a$ otherwise.)
Then we have the usual diagonalization: for any $S$, suppose there is a surjection $f:S \twoheadrightarrow \mathcal{P}(S)$. By separation the subset $D = \{s \in S \mid s \notin \operatorname{Im}_f(s)\}$ of $S$ exists, and $D \in \mathcal{P}(S)$, so there must be $s \in S$ so that $f(s) = D$. But $s \in D \iff s \notin \operatorname{Im}_f(s)$, and since both can't contain $s$, $\operatorname{Im}_f(s) \neq D$. No surjection can exist, so there is no injection from the nonempty $\mathcal{P}(S) \hookrightarrow S$.
(2) For any $S$, $S \subseteq S$, so $S \in \mathcal{P}(S)$. By applying separation to $\mathcal{P}(S)$, the singleton $\{S\}$ exists for every set $S$. Given any set $S$, then, there is an obvious injection into its power set, $f:S \hookrightarrow \mathcal{P}(S)$, $f:x \mapsto \{x\}$.
(1) and (2) suffice to show $|S| < |\mathcal{P}(S)|$. I also believe this only requires $\Delta_0$-separation, but I have trouble understanding how the Levy hierarchy for formulas works. Is my argument correct or did I make a leap somewhere (it feels “too convenient” so I feel like I might have)? Can this be made to work even without extensionality or reduced even further? Are there other choices of axioms very different from the ones I used but also very restricted that can prove it?
We can explicitly construct a model of {extensionality, pairing, separation, powerset, regularity} where Cantor's Theorem fails in the sense that there is an $X$ that does not inject into its power set. The construction proceeds in a sequence of disjoint generations:
First, the red generation $\mathbf R$, consists of all hereditarily finite sets of the ambient universe.
Next, the yellow generation consists of infinitely many sets $(Y_n)_{n\in\mathbb Z}$ such that $Y_n=\{Y_{n+1}\}$.
There are $\omega$ many green generations. The first of them, $\mathbf G_0$, consists of all sets of the form $\{\varnothing, Y_n\}$.
Subsequentely, $\mathbf G_{n+1}$ consists of all subsets of $\mathbf R\cup\mathbf G_n$ that contain at least one element from $\mathbf G_n$ . Note that elements from earlier green generations are not allowed, and that $\mathbf G_n$ is itself an element of $\mathbf G_{n+1}$.
Finally there are $\omega$ many blue generations which each consists of all the finite sets of elements from any earlier generations that have not themselves been added yet.
The union of all the generations is the claimed model.
It satisfies extensionality by construction.
It satisfies pairing, because every $x,y$ in the model will have their $\{x,y\}$ added in one of the blue generations if it didn't arise earlier.
It has all subsets of every set. For a finite set, all its subsets will also be finite, so they are added in one of the blue generations at the latest. The only generations that can contain infinite sets are the green ones, and every subset of a green set is either red or in the same green generation.
It has power sets of everything. Any finite set will have a finite power set, which will be added in one of the blue generations at latest. An infinite set from a green generation will have its power set in the next green generation.
It satisfies regularity: Every descending $\in$-chain will eventually end in a tail of $Y_n$s (because all other generations add sets with members from earlier generations only), but there is never any infinite set that contains an $Y_n$.
Finally, $\mathbf G_0$ is in the model, but the model contains no injective map $f : \mathbf G_0\to\mathcal P(\mathbf G_0)$. Such a map would need to map some $g\in\mathbf G_0$ to a nonempty $h\subseteq \mathbf G_0$. However $h$ is in $\mathbf G_1$, which means that none of the green generations is allowed to contain $\{g,h\}$. This means that $\{g, h\}$ and the Kuratowski pair $\langle g, X\rangle$ are both blue -- but this means that every set that contains $\langle g,h\rangle$ is finite and so cannot be $f$.
The basic idea here is that for any infinite set, all $\in$-chains from it to a $\mathbf G_0$ set have the same length. I have not found a way to make that idea work with an externally well-founded model -- but of course that doesn't prove that there can't be well-founded models that fail Cantor's theorem.
Having a axiom of binary unions (rather than the full axiom of union), plus pairing, power set, and $\Delta_0$ separation (with parameters) would be enough to construct $X\times \mathcal P(X)$ and thence its power set and the injection $x\in X\mapsto \{x\}$.