Why can't I use sum of probabilities as my loss function for machine learning?

266 Views Asked by Bumbble Comm At 05 Apr 2026 - 12:21

I'd like to understand what is the major reason that we are using loss function of the following form in machine learning (I know it is obtained by taking a logarithm of the likelihood of the probability of the data, but that doesn't really say why the other choices are bad):

$$L^{(1)}(\theta)=\frac{1}{N}\sum_i \mathrm{log}\ p(y_i|x_i; \theta)$$

Why can't we, for example, simply use the following? $$L^{(2)}(\theta)=\frac{1}{N}\sum_i p(y_i|x_i; \theta)$$

One of the reasons I am asking is that if I try to optimize $L^{(2)}$, or simply anything else than a loss function that takes a logarithm of the probablity, the algorithm fails to learn anything (e.g. for binary classification the accuracy stays around $0.5$). I sense that it might to do with the different trade-offs between examples that both of these functions suggest, but I can't quite wrap my head around it.

Original Q&A

There are 1 best solutions below

Bumbble Comm On 11 Apr 2016 - 7:42

Because the sum of logarithms is the logarithm of a product, and you assume the observations are independent, so the probability of seeing the particular set of observations is the product (not the sum) of the individual probabilities.

Why can't I use sum of probabilities as my loss function for machine learning?

There are 1 best solutions below

Related Questions in OPTIMIZATION

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions