While reading up on the derivation of a Maximum Likelihood Estimator for population variance, the MLE for Variance ends up being the biased sample variance, $\frac{n-1}{n}S^2$ where $S^2$ is the unbiased sample mean. As satisfactorily discussed here, minimising the MSE can be a good criterion for selecting an estimator, as it accounts for both the variance and the bias of the estimator.
Variance estimator minimising the MSE would be $\frac{n-1}{n+1}S^2$, which is different from it's MLE. Which estimator should then be chosen?
Why is there this discrepancy, if the MLE is supposed to reflect the best attempt at estimating the unknown quantity possible with the given data? (Is this something to do with substituting the sample mean for the population mean while deriving the variance MLE, a step I fail to understand well?)
First of all, it is worth to specify that you are talking about normal distribution. Otherwise, $S^2$ is not (necessarily) the MLE of $\text{var}(X)$.
"if the MLE is supposed to reflect the best attempt..."
There is no universally best method to derive estimators. ML maximization is only one possible and widely accepted method. However, its justification mainly based on the asymptotic ($n\to \infty$) properties of the estimators rather than small sample features like vanishing bias. On a slightly theoretical ground, what would you expect from a ``good'' estimator?
1) Consistency, $ \hat{\tau}_n \xrightarrow{p} \tau$.
1.1) Asymptotically unbiased $\lim_{n\to\infty} \mathbb{E}\hat{\tau}_n=\tau$.
2) Utilize all the sample available information in the sense of Fisher Information, i.e., $\mathcal{I}_{\hat{\tau}_n}(\tau)=\mathcal{I}_{X_1,...,X_n}(\tau) $.
ML estimators satisfies these three conditions, furthermore, under some regular conditions (finite variance and independence of $\tau$ and the support of $X_1,...,X_n$ the MLE will converge in distribution to a normal r.v with the minimal possible variance (Cramer-Rao lower bound; $\mathcal{I}^{-1}_{X_!,...,X_n}(\tau)$).
So.. if it is so good why the aforementioned ''discrepancies'' occur? As you can see, some of the desired properties may hold only for $n\to \infty$. As such, if for some reason you are dealing with small $n$ and value ubiasness - ML estimator won't necessarily be your best choice. Another possible reason to reject the method is intractability of the estimator. Deriving MLE for $\mathcal{N}(\mu, \sigma^2)$ is mathematically easy, but once your parametric space is of higher dimension or/and the ML function is not so smooth and ``nice'' - the task of maximization may become pretty troublesome.
Strictly speaking of the estimator of $\text{var}(X)$ in $\mathcal{N}(\mu, \sigma^2)$. All the presented estimators are asymptotically equivalent in the terms of bias and efficiency as $n\pm 1 \approx n$ for large enough $n$. Thus, for very large samples is doesn't matter which one you choose. For small samples, you may care about bias and efficiency (in terms of MSE), so it is reasonable to choose from one of the other modified estimators.