(This might be an easily searchable question, but I might just not have the vocabulary for it.)
My question is:
Given a fixed probability distribution $p$ over $k$ classes (that is, a non-negative $k$-vector of reals which sum to 1), is the function $$ L(q) = \sum_{i=1}^k p_i(1-q_i)^2, $$ subject to the constraint that $q$ is also a probability distribution, minimized when $q = p$?
Then context is as follows: suppose we are trying to predict the class (out of $k$ possibilities), and $p$ is the true distribution of the data. Further suppose, when the true class is $i$ that a prediction $q$ is punished with loss $(1-q_i)^2$. Then $L(q)$ is the expected loss when making prediction $q$.
I would like to prove (or possibly disprove) that the best prediction you can make is $p$ itself. I thought I had a fairly easy proof of this, but there was a dumb arithmetic error in it.
If I set up a Lagrangian, using only the constraint that $\sum_i q_i = 1$, but not the constraints that $q_i \geq 0$, I obtain for all $i, j$ the condition $$ p_i(q_i-1) = p_j(q_j- 1). $$ In the case $k=2$, this reduces to $p_0q_1 = p_1q_0$, and it follows that the unique solution (subject to the constraint) is $p = q$ -- but I can't quite seem to figure it out when $k > 2$.
Turns out it's just false, and writing a quick search for a counterexample showed that. If we take $p = (1/5, 2/5, 2/5)$ and $q = (0, 2/5, 3/5)$ then $L(p) = 52/125$ while $L(q) = 51/125$. Thus what I tried to prove only holds when $k = 2$.