When I ask "What is the derivative?" the answer I find I get the most (and the answer I think is most satisfying) is: $$ \text{The derivative at a point is a local, linear approximation of the function at that point.} \quad(*)$$
My question is how does one get from this (somewhat imprecise) statement $(*)$ to the mathematical interpretation. Specifically, for functions $f: \mathbb{R} \to \mathbb{R}$ we have $$f'(x) = \lim_{h \to 0}\frac{f(x+h) - f(x)}{h} \tag{1}$$ wherever it exists. For functions $f : \mathbb{R}^n \to \mathbb{R}^m$ we have $$f'(x) \in \text{Hom}(\mathbb{R}^n, \mathbb{R}^m): \lim_{h \to 0} \frac{|f(x+h) - f(x) - f'(x)h|}{|h|} = 0 \tag{2}$$ wherever it exists.
I will now describe my own answer to this question for functions $\mathbb{R} \to \mathbb{R}$:
Suppose we have a function $\mathbb{R} \to \mathbb{R}$ and we wish to find a local, linear approximation to the function at $(x_0, f(x_0))$. For functions $\mathbb{R} \to \mathbb{R}$ a linear approximation will be of the form $y = mx + b$ (a straight line). Of course for this line to be a good approximation, it must pass through $(x_0, f(x_0))$. Now we must answer the question "Of all the straight lines passing through the point, which is the best local approximation of the function at that point?". If we can draw a tangent to the curve at $(x_0, f(x_0))$, it is not hard to convince yourself that the tangent gives the best local approximation. If we can't draw a tangent to the curve at that point then it is not clear which line gives the best approximation, so we can make a convention and say the derivative doesn't exist at this point. One can easily check that (1) is gives the gradient of the tangent iff the tangent exists, so we are done.
I am happy with this argument for functions $\mathbb{R} \to \mathbb{R}$. What I am looking for is an argument for functions $\mathbb{R}^n \to \mathbb{R}^m$. The usual argument in the books I have read appears to go as follows:
Notice that $(1)$ can be written as $$f(x+h) - f(x) = f'(x)h + r(h)$$ where $r(h)$ is such that $$\lim_{h \to 0} \frac{r(h)}{h} = 0$$ Now we have expressed the change in $f$ as a linear function plus a small error term. This naturally generalises to $$f(x+h) - f(x) = f'(x)h + r(h)$$ where $f : \mathbb{R}^n \to \mathbb{R}^m$ with $f'(x)$ a linear map $\mathbb{R}^n \to \mathbb{R}^m$ and $h$ a vector so $f'(x)h$ is the linear map acting on the vector. We similarly require that $$\lim_{h \to 0} \frac{r(h)}{|h|} = 0$$. This can easily seen to be equivalent to (2).
The issue I have with this (and the subject of this post is): We are saying that $(*)$ is equivalent to $$f(x+h) - f(x) = f'(x)h + r(h)$$ for a linear map $f'(x)$ and $r(h)$ a function such that $$\lim_{h \to 0} \frac{r(h)}{|h|} = 0\tag{3}$$ Specifically the word "linear" in $(*)$ becomes the fact that $f'(x)$ is linear, and the words "local approximation" become $r(h)$ such that (3) holds.
My question is why is is it reasonable to represent the fact that we have a local approximation by the condition that $f(x+h)-f(x)$ differs from $f'(x)h$ by a function $r(h)$ such that (3) holds. I don't see why you would choose that notion of "approximates". The statement can be interpreted as $$f(x+h) - f(x) - f'(x)h\tag{4}$$ goes to zero "faster" than that $|h|$ does, but what about we require that (4) goes to zero faster than some other function? How about faster than $\sqrt{|h|}$ or $|h|^2$ or $|h|^3$? How about faster than some other exotic function?.
Additionally I'm not convinced that requiring that (4) goes to zero faster than some function is the only way (or the best way) to represent the notion of a "local approximation". Are there any other reasonable ways of interpreting the notion of "local approximation"?
Suppose $f$ is already linear: $f(x) = Ax$. You would like $f'(x) = A$.
Then any number of functions $f'(x)$ satisfy your definition, e.g. $f'(x) = 0$.
Then your $f'(x)$ does not exist.
It's not really just "some function" though, is it? This $h$ is supposed to be the $dx$ in "$df/dx$". Unfortunately you can't divide by the vector $h$ so dividing by the length is somehow the best you can do.