Two-layer Perceptron for XOR

Question

Two-layer Perceptron for XOR

1.5k Views Asked by Bumbble Comm At 27 Mar 2026 - 10:07

I'm reading Neural Networks for Pattern Recognition by Christopher M. Bishop. It's for a physics class, but I think the problem is closer to mathematics so I'm asking here instead of PSE. Chapter 4 of the book deals with Multi-layer Perceptrons and my question is about Two-layer Perceptrons with Heaviside activation functions (for all units) in particular. The convention for the Heaviside function that is used in the book is $$ g(a) = \left\{ \begin{array}{lr} 0 & a<0 \\ 1 & a \geq 0 \end{array} \right. $$ where I have denoted the step function $g$ in correspondence to the book, which uses this letter for all activation functions.

When the inputs are binary, such a Perceptron generates a Boolean function, since the output is also binary by the nature of the activation function. Now the claim is the following: this sort of network can generate any Boolean function, provided the number $M$ of hidden units is sufficiently large.

A justification for this claim is given in the form of the actual construction of such a network for a random Boolean function with $d$ inputs. The number of possible binary patterns for this case is $2^d$, so the Boolean function is completely determined if the outputs for all $2^d$ of these patterns are specified. To construct the corresponding network, add a hidden unit for every pattern which has a target output value of $1$.
Then construct the weights in such a way that each hidden unit only responds to the corresponding pattern. This can be done by setting the weight from an input to a given hidden unit equal to $+1$ if the corresponding pattern has a $1$ for that input and equal to $-1$ otherwise. Also set the bias for each hidden unit equal to $1-b$ where $b$ is the number of non-zero inputs for the corresponding pattern.
This way of setting up the weights ensures that the summed input for that hidden unit will equal $b$ when the corresponding pattern $P$ is presented to the network, which - together with the bias - yields $a=1$ for that hidden unit and therefore $g(a) = 1$. Bishop continues by stating the summed input will be at most $b-2$ for every other pattern (or: for every other unit when presented with the pattern $P$). In this case adding the bias and applying the activation function yields $g(a=b-2+1-b) = g(a=-1) = 0$. The last step is to connect each hidden unit to the output with a weight $+1$ and setting the output bias equal to $-1$.

Now, in order to better understand this and to be able to explain it to others using a simple example, I tried to apply this way of constructing the network on the 2-bit XOR problem. In this case, two patterns have a target output of $1$: $01$ and $10$. Therefore, we need two hidden units. One of which has a weight $-1$ for the first input and a weight $+1$ for the second input. Conversely, the weights for the second hidden unit are $+1$ and $-1$, respectively. The number of non-zero inputs is $1$ for each pattern so the bias is $1-1 = 0$ for both hidden units. So our network looks like this: Network diagram Unfortunately, this doesn't seem to do what it should. In fact, if the above Heaviside function is used for all units, both the output and the hidden units, this network yields $1$ for all inputs. If the step function is used only for the hidden units and the activiation function for the output is taken to be simply $g(a) = a$, we find the exact opposite of XOR: $1$ for $00$ and $11$, $0$ for $01$ and $10$.

In trying to find out why this happens, I've become suspicious of the claim emphasized in bold: that the summed input for a hidden unit corresponding to a pattern $P$ with $b$ non-zero inputs is at most equal to $b-2$ if the pattern presented is not $P$. I would argue that by simply changing one of the inputs of $P$ from $1$ to $0$, the summed input would be $b-1$, not $b-2$. If that is indeed the case, it's possible to have $b-1+1-b=0$ as the sum of all the weighed inputs and the bias, which yields a $1$ via the step function. But changing one of the inputs from $0$ to $1$ should yield $b-2$ for the summed input and therefore the $11$ pattern should yield $0$, which doesn't happen with the above network.

The right output is generated if I use the above convention for the step function for the output but the one where $a=0$ yields $0$ for the hidden units. However, this is not mentioned in the book and it seems a bit ridiculous. There must be something else I am missing. I have noticed the entire thing works if (with everything else kept as set up in the book) I don't take $1-b$ but instead $-b$ as the bias for the hidden units. At least for the XOR (and XNOR) problem with the convention for the Heaviside function as defined by the book. But I would feel more sure of all this if someone were to corroborate it or indeed correct it and explain what went wrong. So can anyone enlighten me?

Original Q&A

Two-layer Perceptron for XOR

Related Questions in COMPUTATIONAL-MATHEMATICS

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions