Consider $X_1, \dots, X_n \stackrel{\text{iid}}{\sim} \text{Exp}(\lambda)$ and define \begin{equation*} Y_i = \begin{cases} X_i & X_i \leq c \\ c & X_i > c \end{cases} \end{equation*} for some $c > 0$.
I am trying to derive the likelihood $L(y_1, \dots, y_n; \lambda)$ in terms of $K = |\{i : y_i = c\}|$, but cannot seem to get there.
This is what I did so far is derive $p = P(Y_i = c) = P(X_i > c) = e^{-\lambda c}$. I already don't really know how to proceed, since the distribution is neither continuous nor discrete. I could write down something along the lines of
$$ L(y_1, \dots, y_n; \lambda) = p^k \prod_{i : y_i < c} f_\lambda(y_i) = e^{-\lambda k c} \lambda^{n-k} e^{-\lambda \sum_{i:y_i < c} y_i}. $$
However, I don't really know how to formally argue that I can just multiply the densities / probability and why exactly I can really on the density of an exponential distribution in the second factor. Any help how to solve this step-by-step is welcome!
Your intuition is correct: because the sample is IID, we may write $$\mathcal L(\lambda \mid \boldsymbol y) \propto \prod_{i=1}^K \Pr[X_i > c] \prod_{i=K+1}^n f_{X_i}(x_i) = e^{-K \lambda c} \lambda^{n-K} e^{-(n-K) \lambda \bar x'}, \tag{1}$$ where $$\bar x' = \frac{1}{n-K} \sum_{i=1}^n x_i \mathbb 1(x_i \le c)$$ is the mean of the observations that were not censored. We may also write this in terms of the $y_i$ as $$\bar x' = \frac{n \bar y - K c}{n-K}, \tag{2}$$ where $\bar y$ is the censored sample mean. Thus the likelihood simplifies to $$\mathcal L(\lambda \mid \boldsymbol y) = \lambda^{n-K} e^{-n \lambda \bar y}. \tag{3}$$
Thus the log-likelihood is $$\ell (\lambda \mid \boldsymbol y) = (n-K) \log \lambda - n \bar y \lambda$$ and solving for the critical point, we require
$$0 = \frac{\partial \ell}{\partial y} = \frac{n-K}{\lambda} - n \bar y,$$ or $$\hat \lambda = \frac{n-K}{n \bar y}. \tag{4}$$
The reason why we can do this is because the likelihood function is not a density with respect to $\lambda$. Rather, it expresses as a function of $\lambda$ how likely we were to have observed the sample $\boldsymbol y = (y_1, \ldots, y_n)$. That Equation $(1)$ mixes probabilities and densities together is immaterial because it still quantifies the relative likelihood of observing such a sample.
Let's consider an example. Suppose I generate a sample for $n = 10$ $$\begin{align} \boldsymbol x &= (0.852463, 3.28123, 1.0474, 2.33075, 0.975358, \\ & \quad 4.53211, 0.617489, 0.627495, 0.0987318, 0.801737). \end{align}$$ This sample happened to be generated from the choice $\lambda = 1/3$. Next, suppose we censor at $c = 1$: this gives us the censored observations $$\begin{align} \boldsymbol y &= (0.852463, 1, 1, 1, 0.975358, \\ & \quad 1, 0.617489, 0.627495, 0.0987318, 0.801737). \end{align}$$
The sample statistics we can calculate are $$K = 4, \quad \bar y = 0.797327.$$
Then our estimate is $$\hat \lambda = \frac{n \bar y - Kc}{n-K} = 0.662212,$$ which is actually quite similar to the MLE of the uncensored sample $1/\bar x = 0.659424$. Obviously such a small sample could not be expected to perform well for estimating $\lambda$, but this is just an example.
Now, what is the value of $\mathcal L$ for the uncensored and censored samples? That's easy:
$$\mathcal L(\lambda \mid \boldsymbol x) = \prod_{i=1}^{10} \lambda e^{-\lambda x_i} = \lambda^{10} e^{-n \bar x \lambda} = \lambda^{10} e^{-15.1648 \lambda},$$ and from Equation $(3)$, $$\mathcal L(\lambda \mid \boldsymbol y) = \lambda^6 e^{-7.97327 \lambda}.$$
Of course they won't look the same. Neither one is a proper density over $\lambda$. They aren't probability distributions in the frequentist sense because $\lambda$ is a parameter: a fixed but unknown value. A Bayesian would think nothing of normalizing the likelihood to obtain a posterior density for $\lambda$, which is certainly possible (and useful for, say, characterizing the uncertainty of the point estimate).
Now, your question seems to be about why the probability portion of the likelihood, $\prod \Pr[X_i > c]$, and the density portion $\prod f_{X_i}(x_i)$, can just be multiplied together, when their relative values are not comparable--e.g., the density for a single observation could exceed $1$. But what you need to understand is that you aren't comparing these two parts of the likelihood against each other. Instead, these components are fixed by the observations, and what you are doing by calculating a MLE is finding the value of $\lambda$ that maximizes $\mathcal L$ for the data you got.
For instance, if we perform a likelihood ratio test, the test statistic looks something like $$\Lambda = \frac{\mathcal L(\lambda_1 \mid \boldsymbol y)}{\mathcal L(\lambda_2 \mid \boldsymbol y)},$$ or some injective transformation thereof, and what you can see from this expression is that the comparison is between $\lambda_1$ and $\lambda_2$ within each of the aforementioned components of the likelihood; i.e. the ratio can be written as the product of the ratios of the censored observations and the uncensored observations. There's no point in normalizing anything because the likelihood is relative, not absolute. It says "this value of $\lambda$ is more (or less) likely than another value." It doesn't say "the probability of $\lambda$ is...."
There are more formal treatments of this concept, but I have chosen to keep it less formal so that the discussion is more accessible.