I am a bit confused how the MLE (maximum likelihood estimation) formulation works.
Using Bayes theorem, we can see that
$$\mathbb P(\theta|X)=\mathbb P(\theta)\mathbb P(X|\theta)/\mathbb P(X)$$
$\mathbb P(X)$ is independent of model choice, so we can simply maximize $\mathbb P(\theta)\mathbb P(X|\theta)$. Assuming a uniform prior, we are then simply maximizing $\mathbb P(X|\theta)$.
Now, here is where my confusion starts. For the simple case, let's assume that $X=\{x_1\}$ and $\theta$ has parameters $\mu$ and $\sigma$ for a normal distribution. So then $\mathbb P(X|\theta)$ is just $\mathbb P(z\sigma+\mu=x_1)$. Given that the normal distribution is continuous, the probability mass of a single point is exactly zero for all points.
Now, common sense tells us that (holding the variance equal) a normal distribution with parameter $\mu$ close to $x_1$ is a better fit than a normal distribution with a parameter $\mu$ far from $x_1$.
However, $\mathbb P(z\sigma+\mu=x_1)=0$ for all $\mu$ and $\sigma$, so how is this reconciled mathematically?
With a discrete prior probability mass function for $\theta$ of $\mathbb P_0(\theta)$ and a discrete conditional probability for $X$ given $\theta$, you have (much as you stated) a posterior $$\mathbb P(\theta \mid X=x) \propto \mathbb P_0(\theta) \mathbb P( X=x \mid \theta)$$ and you can divide by $\sum_{\theta'} \mathbb P_0(\theta') \mathbb P( X=x \mid \theta')$ to turn this into an actual probability
With a prior density for $\theta$ of $\pi_0(\theta)$ and a discrete conditional probability for $X$ given $\theta$, you have a posterior density $$\pi(\theta \mid X=x) \propto \pi_0(\theta) \mathbb P( X=x \mid \theta)$$ and you can divide by $\int_{\theta'} \pi_0(\theta') \mathbb P( X=x \mid \theta') d\theta'$ to turn this into a probability density
With a prior density for $\theta$ of $\pi_0(\theta)$ and a conditional probability density for $X$ given $\theta$, you have a posterior density $$\pi(\theta \mid X=x) \propto \pi_0(\theta) f(x \mid \theta)$$ and you can divide by $\int_{\theta'} \pi_0(\theta') f(x \mid \theta') d\theta'$ to turn this into a probability density
In the third case, you can find the $\theta$ which maximises $f(x \mid \theta)$ to give the maximum likelihood estimate of $\theta$. A Bayesian could find the $\theta$ which maximises $\pi(\theta \mid X=x)$ to find the mode of the posterior distribution (the so-called maximum a posteriori probability estimate) though, if being forced to give a point estimate, might prefer to take a loss function into account and give the value which minimises the expected loss based on the whole posterior distribution
Either way, shifting the analysis from probability mass functions for discrete distributions to probability density functions for continuous distributions should often remove your issue about zero probabilities for the maximum likelihood estimator or for the mode of the posterior distribution