Why can't I use sum of probabilities as my loss function for machine learning?

266 Views Asked by At

I'd like to understand what is the major reason that we are using loss function of the following form in machine learning (I know it is obtained by taking a logarithm of the likelihood of the probability of the data, but that doesn't really say why the other choices are bad):

$$L^{(1)}(\theta)=\frac{1}{N}\sum_i \mathrm{log}\ p(y_i|x_i; \theta)$$

Why can't we, for example, simply use the following? $$L^{(2)}(\theta)=\frac{1}{N}\sum_i p(y_i|x_i; \theta)$$

One of the reasons I am asking is that if I try to optimize $L^{(2)}$, or simply anything else than a loss function that takes a logarithm of the probablity, the algorithm fails to learn anything (e.g. for binary classification the accuracy stays around $0.5$). I sense that it might to do with the different trade-offs between examples that both of these functions suggest, but I can't quite wrap my head around it.

1

There are 1 best solutions below

3
On

Because the sum of logarithms is the logarithm of a product, and you assume the observations are independent, so the probability of seeing the particular set of observations is the product (not the sum) of the individual probabilities.