I have just finished watching a lecture from Patrick Winston's AI course. In an attempt to understand the mathematics behind back-propagation, I have formulated a simplistic neural network as follows:
Task
Given an input $x \in \{0, 1\}$, train a simple neural network to mimic the identity function, that is $f(x) = x$.
Definitions
$x$ = the input
$T$ = the target function; in this case $T(x) = x$
$y$ = the desired output; $y = T(x)$
$A$ = the activation function; I'll use $A(x) = \frac{1}{1 + e^{-x}}$
$\hat{T}$ = the feed-forward function
$\hat{y}$ = the output of the network; $\hat{y} = \hat{T}(x)$
$E$ = the error function; I'll use $E(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^2$
$w$ = a weight $\in [0, 1]$
$\alpha$ = the learning rate
Note that $\hat{T}(x) = A(wx)$ in this very simplistic example.
Formulating back-propagation via gradient descent
In a particular iteration $i$ of this network, let $x_i \in \{0, 1\}$ and $\hat{y}_i = \hat{T}(x_i)$. Now, the weight $w_i$ of this network needs to be adjusted such that $E(\hat{y}_i, y_i) > E(\hat{y}_{i+1}, y_{i+1})$. So, $\frac{\partial{E}}{\partial{w_i}}$ will give the rate of change function of $E$ with respect to $w_i$. Computing this partial: \begin{align} \tag{1}\frac{\partial{E}}{\partial{w_i}} &= \frac{\partial{E}}{\partial{\hat{y}_i}} \frac{\partial{\hat{y}_i}}{\partial{w}_i} + \frac{\partial{E}}{\partial{y_i}} \frac{\partial{y_i}}{\partial{w}_i}\\ \tag{2}&= (\hat{y}_i - y_i) \cdot \frac{\partial{\hat{y}_i}}{\partial{w}_i} - (\hat{y}_i - y_i) \cdot 0\\ \tag{3}&= (\hat{y}_i - y_i) \cdot \frac{\partial{\hat{y}_i}}{\partial{w}_i}\\ \tag{4}&= (\hat{y}_i - y_i) \cdot \frac{\partial{\hat{T}}}{\partial{w}_i}\\ \tag{5}&= (\hat{y}_i - y_i) \cdot \frac{\partial{A(w_ix_i)}}{\partial{w}_i}\\ \tag{6}&= (\hat{y}_i - y_i) \cdot x_i \cdot (1 - A(w_ix_i)) \cdot A(w_ix_i) \end{align}
The confusion
How does finding $\frac{\partial{E}}{\partial{w_i}}$ help in the gradient descent algorithm in this case? I only understand that the gradient of a function at a point returns a vector pointing in the direction of the greatest incline. How can the weight $w_i$ be updated in such a manner?

Once you have found the gradient $\frac{\partial{E}}{\partial{w_i}}$, you change the weight as follows:
$w_i (new) = w_i (old) -\mu \cdot \frac{\partial{E}}{\partial{w_i}}$
with $\mu$ being a small positive number (the learning rate)
So: if $\frac{\partial{E}}{\partial{w_i}}$ is positive, then increasing the weight will increase the error, so you want to move the weight in the opposite direction, i.e. subtract a little bit from the weight (and if $\frac{\partial{E}}{\partial{w_i}}$ is negative, then this formula will add a little bit to the weight).
We of course do not know how much to change the weight, so we only change it a little bit (the learning rate $\mu$ is usually set to be fairly small). So, we 'nudge' all weights in the 'right' direction (the direction that should decrease the error), and then we repeat the process.
Finally, we change the weight proportional to $\frac{\partial{E}}{\partial{w_i}}$, for that will follow the steepest descent in the 'error landscape'. Another way of thinking about that: we could change all weights with the same (little) amount into the right direction (up or down, again depending on the sign of the derivative), but because this is such a non-linear system we are dealing with, we can anticipate that changing weights may be good for certain input-output patterns, but bad for others. So, we want to change as 'little' as possible in terms of 'total change'. So: if we find a large $\frac{\partial{E}}{\partial{w_i}}$, then we know that changing that weight will have a large effect in decreasing that error (at least locally), so we'd rather change that weight than changing a weight that would have far less effect on the error.