I'm reading Neural Networks for Pattern Recognition by Christopher M. Bishop. It's for a physics class, but I think the problem is closer to mathematics so I'm asking here instead of PSE. Chapter 4 of the book deals with Multi-layer Perceptrons and my question is about Two-layer Perceptrons with Heaviside activation functions (for all units) in particular. The convention for the Heaviside function that is used in the book is $$ g(a) = \left\{ \begin{array}{lr} 0 & a<0 \\ 1 & a \geq 0 \end{array} \right. $$ where I have denoted the step function $g$ in correspondence to the book, which uses this letter for all activation functions.
When the inputs are binary, such a Perceptron generates a Boolean function, since the output is also binary by the nature of the activation function. Now the claim is the following: this sort of network can generate any Boolean function, provided the number $M$ of hidden units is sufficiently large.
A justification for this claim is given in the form of the actual construction of such a network for a random Boolean function with $d$ inputs. The number of possible binary patterns for this case is $2^d$, so the Boolean function is completely determined if the outputs for all $2^d$ of these patterns are specified. To construct the corresponding network, add a hidden unit for every pattern which has a target output value of $1$.
Then construct the weights in such a way that each hidden unit only responds to the corresponding pattern. This can be done by setting the weight from an input to a given hidden unit equal to $+1$ if the corresponding pattern has a $1$ for that input and equal to $-1$ otherwise. Also set the bias for each hidden unit equal to $1-b$ where $b$ is the number of non-zero inputs for the corresponding pattern.
This way of setting up the weights ensures that the summed input for that hidden unit will equal $b$ when the corresponding pattern $P$ is presented to the network, which - together with the bias - yields $a=1$ for that hidden unit and therefore $g(a) = 1$. Bishop continues by stating the summed input will be at most $b-2$ for every other pattern (or: for every other unit when presented with the pattern $P$). In this case adding the bias and applying the activation function yields $g(a=b-2+1-b) = g(a=-1) = 0$. The last step is to connect each hidden unit to the output with a weight $+1$ and setting the output bias equal to $-1$.
Now, in order to better understand this and to be able to explain it to others using a simple example, I tried to apply this way of constructing the network on the 2-bit XOR problem. In this case, two patterns have a target output of $1$: $01$ and $10$. Therefore, we need two hidden units. One of which has a weight $-1$ for the first input and a weight $+1$ for the second input. Conversely, the weights for the second hidden unit are $+1$ and $-1$, respectively. The number of non-zero inputs is $1$ for each pattern so the bias is $1-1 = 0$ for both hidden units. So our network looks like this:
Unfortunately, this doesn't seem to do what it should. In fact, if the above Heaviside function is used for all units, both the output and the hidden units, this network yields $1$ for all inputs. If the step function is used only for the hidden units and the activiation function for the output is taken to be simply $g(a) = a$, we find the exact opposite of XOR: $1$ for $00$ and $11$, $0$ for $01$ and $10$.
In trying to find out why this happens, I've become suspicious of the claim emphasized in bold: that the summed input for a hidden unit corresponding to a pattern $P$ with $b$ non-zero inputs is at most equal to $b-2$ if the pattern presented is not $P$. I would argue that by simply changing one of the inputs of $P$ from $1$ to $0$, the summed input would be $b-1$, not $b-2$. If that is indeed the case, it's possible to have $b-1+1-b=0$ as the sum of all the weighed inputs and the bias, which yields a $1$ via the step function. But changing one of the inputs from $0$ to $1$ should yield $b-2$ for the summed input and therefore the $11$ pattern should yield $0$, which doesn't happen with the above network.
The right output is generated if I use the above convention for the step function for the output but the one where $a=0$ yields $0$ for the hidden units. However, this is not mentioned in the book and it seems a bit ridiculous. There must be something else I am missing. I have noticed the entire thing works if (with everything else kept as set up in the book) I don't take $1-b$ but instead $-b$ as the bias for the hidden units. At least for the XOR (and XNOR) problem with the convention for the Heaviside function as defined by the book. But I would feel more sure of all this if someone were to corroborate it or indeed correct it and explain what went wrong. So can anyone enlighten me?