I watched the popular 3Blue1Brown video about the work of neural networks https://youtu.be/IHZwWFHWa-w. And for me, one of the most mysterious parts was the explanation of the gradient descent
As I can see the process of gradient descent resembles walking down a curved surface (from 9:57 of the video). But I really can't understand how such an inherently linear system as a neural network can in any way compose that complex structure. I mean for me it should look like a... plane? maybe even with one cavity, but no more
Here is a differential of C by one weight (from 5:44 of the video) $$\dfrac{\partial C}{\partial w}(w) = 0$$ , and sure I can't understand how changing one weight can bring so a nonlinear impact.
For context, I am quite into linear algebra and primitive calculus, and I even implemented my own perception back then. But I feel like basic concepts slipt away from me. I'm sorry about asking somebody else explanation, but I think 3Blue1Brown videos are quite vital
I absolutely not hoping for an immediate answer. Please tell me how I need to complete my question so it would be understandable what I'm asking for, despite the chaos in my head
Edit: It still doesn't make sense to me how changing one weight that simply multiplies constant value could make the loss function wiggle differential of C by one weight
Let's call the neural network $f^{\theta}$, it takes an input $x$ (such as an image) from some space $\mathcal X$ and returns an output $y$ (such as "cat" or "dog") from some space $\mathcal Y$. $\theta$ denotes the parameters of the neural network, such as for instance the weights of a Perceptron.
In supervised learning you have a training data set $(x_1,y_1),\dots, (x_n, y_n)$ for some $n\in\mathbb N$. You also have some loss function $\mathcal L:\mathcal Y^2\to\mathbb R$ that judges how good your output is. (In a simple linear regression task, one might for instance choose $\mathcal Y=\mathbb R$ and $\mathcal L:\mathbb R^2\to\mathbb R$, $\mathcal L(a,b)=(a-b)^2$ (cf. Gauss-Markov Theorem).)
Your goal is to minimize over $\theta$ the sum
$$\sum_{k=1}^n \mathcal L(x_k, f^\theta(x_k)).$$
Note that this sum is usually highly non-linear in $\theta$, even if each $f^\theta$ is linear in the $x_k$, simply because the training data itself may have (and often does have) a complicated, non-linear structure.
Furthermore, neural networks are usually able to approximate also highly non-linear functions, in fact, by the Universal approximation Theorem, any continuous function on a compact set can be approximated uniformly to arbitrary precision by a sufficiently wide neural network.