I'd like to understand what is the major reason that we are using loss function of the following form in machine learning (I know it is obtained by taking a logarithm of the likelihood of the probability of the data, but that doesn't really say why the other choices are bad):
$$L^{(1)}(\theta)=\frac{1}{N}\sum_i \mathrm{log}\ p(y_i|x_i; \theta)$$
Why can't we, for example, simply use the following? $$L^{(2)}(\theta)=\frac{1}{N}\sum_i p(y_i|x_i; \theta)$$
One of the reasons I am asking is that if I try to optimize $L^{(2)}$, or simply anything else than a loss function that takes a logarithm of the probablity, the algorithm fails to learn anything (e.g. for binary classification the accuracy stays around $0.5$). I sense that it might to do with the different trade-offs between examples that both of these functions suggest, but I can't quite wrap my head around it.
Because the sum of logarithms is the logarithm of a product, and you assume the observations are independent, so the probability of seeing the particular set of observations is the product (not the sum) of the individual probabilities.