Question on derivation of probability proportion in Deep Learning Book

38 Views Asked by At

I am working through the Deep Learning Book, I am currently on the regularization chapter (https://www.deeplearningbook.org/contents/regularization.html).

My question concerns the third step in the following derivation:

$$\begin{align*} \tilde{P}_{\text{ensemble}}(y=y\vert v)&\propto \sqrt[2^n]{\prod_{d\in \{0,1\}^n} \exp(W_{y,:}^T(d\odot v) + b_y})\\ &=\exp\left(\frac{1}{2^n} \sum_{d \in \{0,1\}^n} W_{y,:}^T(d\odot v) + b_y\right)\\ &=\exp(\frac{1}{2} W_{y,:}^Tv + b_y). \end{align*}$$

I'd like to clarify that $\odot$ represents the operation of component-wise multiplication and $d$ and $v$ are vectors while $W$ is an $m \times n$ matrix.

The reason I am confused is because it seems that in the second to last step we are summing over all $n$ bit binary strings. So to me the simplification should be

$$\begin{align*} \tilde{P}_{\text{ensemble}}(y=y\vert v)&\propto \sqrt[2^n]{\prod_{d\in \{0,1\}^n} \exp(W_{y,:}^T(d\odot v) + b_y})\\ &=\exp\left(\frac{1}{2^n} \sum_{d \in \{0,1\}^n} W_{y,:}^T(d\odot v) + b_y\right)\\ &=\exp(\frac{1}{2^n} 2^n(W_{y,:}^Tv + b_y))\\ &=\exp(W_{y,:}^Tv + b_y) \end{align*}$$

Where does the extra factor $\frac{1}{2}$ come from?

1

There are 1 best solutions below

0
On BEST ANSWER

By linearity,

$$ \sum_{d\in\{0,1\}^n}W_{y,:}^T(d\odot v)=W_{y,:}^T\left(\left(\sum_{d\in\{0,1\}^n}d\right)\odot v\right)\;. $$

So indeed you're summing over all $2^n$ binary vectors $d$. In each component, half of them have a $0$ and half of them have a $1$. Thus every component of $v$ is multiplied by $\frac12\cdot2^n$.