Let $(x_i,y_i)$, $i=1,...,N$, be an i.i.d. dataset. Suppose we have a parametric distribution $f(y|x,\omega)$, parameterized by $\omega$, and we want to find the value $\omega^*$ such that $f(y|x,\omega^*)$ is the best approximation of the true distribution $p(y|x)$.
A usual criterion is the maximization of the likelihood of the $y_i$'s conditioned on the $x_i$'s, i.e.:
$\newcommand{\argmax}{\mathop{\mathrm{arg\,max}}}$ $\omega^* = \argmax_{\omega} f(y_1,...,y_N|x_1,...,x_N,\omega) =\argmax_{\omega} \prod_{i=1}^{N} f(y_i|x_i,\omega)$.
(the equality above holds due to the assumption that the examples in the dataset are independent)
For practical reasons, it is usual to work instead with the following (equivalent) formulation:
$\newcommand{\argmin}{\mathop{\mathrm{arg\,min}}}$ $\omega^* = \argmin_{\omega} \sum_{i=1}^{N} -\log(f(y_i|x_i,\omega))$
(these formulations are equivalent because the $\log(\cdot)$ function is monotonic)
For convenience, let me define $L(\omega)=\sum_{i=1}^{N} -\log(f(y_i|x_i,\omega))$ and call it a loss function.
If the $y_i$'s are discrete random variables, then $f(y|x,\omega) \leq 1$ and, therefore, $L(\omega) \geq 0$ for all $\omega$, so our loss function is always non-negative and so it is lower bounded.
If, instead, the $y_i$'s are continuous random variables, then $f(y|x,\omega)$ is a pdf and so it is not necessarily less than $1$ (although it integrates to $1$ w.r.t. $y$). Under these circumstances, is it possible to ensure that $L(\omega)$ is lower bounded (i.e. that the optimization problem actually has a solution regardless of the model family)? Is there any obvious lower bound for it?
Thank you in advance.
There is no lower bound in general. Here is a counter-example !
Consider a single point $x_0$ and choose to model it using a Gaussian distribution with mean $x_0$ and an unknown variance $\sigma^2$.
The negative log-likelihood writes \begin{equation} \ell(\sigma,x_0) = \log\left(\sqrt{2\pi}\sigma\right) \end{equation} and has no lower bound.
This is the reason why it is important to choose appropriate models and to regularize your model using priors. For example, using a Frechet prior with $m=0$, $s=1$ and $\alpha=2$ for the standard deviation then the negative log-posterior writes \begin{equation} \ell_p(\sigma,x_0) = \log\left(\sqrt{2\pi}\sigma\right) + \log\left(\frac{\sigma^3}{2}\right) + \frac{1}{\sigma^2}. \end{equation} One can show that $\ell_p$ has a lower bound ($\sigma > 0$).