Suppose I observe the outcomes ($x_1$ through $x_n$) of $n$ rolls ($X_1$ through $X_n$) of a fair, $\theta$-sided die, and want to find a point estimate using Maximum Likelihood Estimation. I will state the problem without referring to the likelihood function in order to eliminate one layer of confusion.
In the mindset of modeling the problem, it seems to me that the goal is to choose the possible value of $\theta$ which has the greatest probability of being its actual value, given the signals generated by the roll information, and that this goal in mathematical notation is $$\max_\theta \Pr(\theta\ |\ (\bigcap_{i = 1}^n X_i = x_i))$$ However, in the course of solving the problem, the the formula used is $$\max_\theta \Pr((\bigcap_{i = 1}^n X_i = x_i)\ |\ \theta)$$
Is the point of the maximum likelihood principle that both of these maximizations produce the same answer? If not, then why is the latter the correct one to solve, even while the former seems to better capture the English description of the problem statement?
The first equation is problematic in the frequentist viewpoint. To illustrate, let us consider the familiar model $$X_i \mid \theta \sim \operatorname{Bernoulli}(\theta), \\ \Pr[X_i = 1] = \theta, \quad \Pr[X_i = 0] = 1 - \theta.$$ Given a sample $(x_1, \ldots, x_n)$, the joint probability is $$\prod_{i=1}^n \Pr[X_i = x_i \mid \theta] = \prod_{i=1}^n \theta^{x_i} (1-\theta)^{1-x_i} = \theta^{\sum x_i} (1 - \theta)^{n - \sum x_i}, \tag{1}$$ where $\sum x_i$ is the sample total, or equivalently, the number of observations that equal $1$.
But the problem with $(1)$ is that this is not a probability density over the parameter space $\theta \in [0,1]$. To be specific, it is proportional to a density, but $$\int_{\theta=0}^1 \theta^{\sum x_i} (1 - \theta)^{n - \sum x_i} \, d\theta \ne 1$$ for general $n$ and $\sum x_i$. This is why a statement like $$\Pr\left[\theta \;\left| \;\bigcap_{i=1}^n X_i = x_i \right.\right]$$ is, strictly speaking, not quite correct unless we are talking about a Bayesian posterior for $\theta$, in which case a prior distribution for $\theta$ will need to be specified. Moreover, even if we allow a Bayesian interpretation, $\Pr[\theta \mid \cdot]$ is also problematic in the case where $\theta$ is a parameter with continuous support; instead, we would need to write $f(\theta \mid \cdot)$.
In the context of maximum likelihood estimation, it is simply better to dispense with all of this and describe the quantity to be maximized as $\mathcal L(\theta \mid \cdot)$, a likelihood function of $\theta$ with respect to some sample or observed outcome. The fact that the likelihood is proportional to the joint density, i.e. $$\mathcal L(\theta \mid x_1, \ldots, x_n) \propto f_{X_1, \ldots, X_n}(x_1, \ldots, x_n \mid \theta), \tag{2}$$ is, for all intents and purposes, a definition. Likelihoods are not unique; they are unique up to a (positive) constant of proportionality. It is because of $(2)$ that we are able to use the joint density to maximize the likelihood, since knowing the density as a function of the parameter $\theta$ allows us to choose the "most likely" $\theta$ that generated the sample. Note also that there are nontrivial cases for which "most likely" is not unique; i.e., there can be more than one MLE for a sample.