Multinomial Logistic Regression

71 Views Asked by At

(1) $$P(y^{(i)} =1\mid X,W) = \frac{\exp(W^{(i)^T}X)}{\sum_{j=1}^m \exp(W^{(j)^T}X)}$$

$W$ and $y$ are vectors where the superscript is an index. And there are $m$ classes (that is, there are $m$ $y^i$'s).

The normalization condition:

$$\sum_{j=1}^m P(y^{(i)} =1\mid X,W) = 1.$$

I'm having hard time understanding this sentence: "Because of the normalization condition the weight vector for one of the classes need not be estimated. Without loss of generality we can set $w^{m} = 0$." I don't see the reasoning that starts with the normalization condition and ends with one of the weight vectors,say, $w^m = 0$.

As a side question, how do you derive (1)?