I have coded the following, which creates a network of neurons with a random number of neurons in each layer, and a random number of layers. In addition, it sets random neuron connection weights, applies a bias node and assigns a random weight to each of its connections, and outputs each neuron's sum of inputs using a sigmoid activation function. Neurons are gray-scaled to represent their output (brighter meaning higher output, darker meaning lower). The transparancy of neuron connections represent how great their weight is, while green connections are positive and red connections are negative. As far as I can tell, everything is coded correctly up until this point.
Now, in order to test all this, I would like to be able to define an arbitrary function that, for each possible combination of inputs, has a pre-assigned set of outputs that I could train the network to learn. My first attempt this failed, and I wasn't sure what I did wrong. Here is what I know. The cost function $e$:
$$e = \frac{1}{2} \sum (c - a)^2$$
where $c$ is the correct output for this output node and $a$ is the actual output of this output node.
My attempt was to take each neuron connection connected by the same base neuron and seeing whether an increase in weight by a rate constant or a decrease in weight by the same rate constant lead to a decrease in cost. Whichever showed a greater decrease in cost, I would switch the weight to. I would do this for each connection that shares a common neuron at its base. I wasn't sure how to treat the bias node's connections since I would assume it should be treated differently since its result doesn't take a neuron's output into consideration.
I haven't gotten into partial derivatives in multivariable yet, but I think I understand the concept -- a function of cost that takes each connection's weight as an input can be split into a separate functions of an individual weight's affect on cost while keeping the other weights constant. Wherever each each 2D function's cost is at a global minimum, that change in weight can be applied to that weight, which would be "moving across the gradient" when applied to each connection that share a common base neuron. Then the same can be applied to each connection in the net, and if more tuning is needed, the same can be applied iteratively until the net is properly calibrated. Is this correct?
Partial derivatives are absolutely essential to gradient descent. I would recommend taking a class or reading a book on them before you proceed. However, I can give some pointers if you really want to do this now.
The partial derivative of a function with respect to a variable is obtained by "pretending" that the other variables are constants and taking the derivative normally; for example, the partial derivative with respect to $x$ of $x^3 + 2xy + \cos(y)$ is $3x^2 + 2y + 0$, because as far as $x$ is concerned $y$ is constant.
When you have a function of many variables, $f(x, y, z, \ldots)$, the gradient is the vector whose components are its partial derivatives. So, for example, the gradient of $x^3 + 2xy + \cos(y)$ is the vector $\langle 3x^2 + 2y, 2x - \sin(y)\rangle$. The cool thing about the gradient is that it points in the direction of greatest increase - so, for example, at the point $(0, 1)$ the gradient above is the vector $\langle 2, -\sin(1)\rangle$, so the function $f(x,y) = x^3 + 2xy + \cos(y)$ increases the fastest in that direction (which happens to be south-east-ish).
So the idea of neural network gradient descent is that you take your cost function as a function of all of the parameters of your net - that is, as one big expression in terms of all of the weights - and take the gradient. The result will be a vector with one component for each weight. What you then do is multiply this vector by a small amount ($0.001$, say) and subtract each component from the corresponding weight (remember, the gradient points in the direction of greatest increase, and you want to decrease your cost function, so you have to go the other way). Note that global minima - or indeed optimization of any kind - have nothing to do with it. Note also that we're not paying attention to which connections share base neurons or anything else like that - we're changing the whole net at once.
I hope what I've laid out will be helpful, but I imagine it'll be very difficult. I understand the desire to just target a particular goal and try to learn exactly the things required for that, but sometimes you just have to do the foundations first. This is probably one of those times - I really recommend getting a thorough understanding of partial derivatives before you try to tackle this. If you're learning it on your own, look up the terms partial derivative and gradient vector; that should be enough information to work with.