How is the derivative truly, literally the "best linear approximation" near a point?

20.9k Views Asked by At

I've read many times that the derivative of a function $f(x)$ for a certain $x$ is the best linear approximation of the function for values near $x$.

I always thought it was meant in a hand-waving approximate way, but I've recently read that:

"Some people call the derivative the “best linear approximator” because of how accurate this approximation is for $x$ near $0$ (as seen in the picture below). In fact, the derivative actually is the “best” in this sense – you can’t do better." (from http://davidlowryduda.com/?p=1520, where $0$ is a special case in the context of Taylor Series).

This seems to make it clear that the idea of "best linear approximation" is meant in a literal, mathematically rigorous way.

I'm confused because I believe that for a differentiable function, no matter how small you make the interval $\epsilon$ around $x$, there will always be for any $a$ near $x$ in that interval a line going through $x$ that is either as good an approximation of $f(a)$ as the one given by $f'(x)$ (in case the function is actually linear over that interval), or a better approximation (the case in which the line going through $(x, f(x))$ also goes through (a, f(a)) and any line between this line and the tangent at $x$).

What am I missing?

10

There are 10 best solutions below

10
On BEST ANSWER

As some people on this site might be aware I don't always take downvotes well. So here's my attempt to provide more context to my answer for whoever decided to downvote.

Note that I will confine my discussion to functions $f: D\subseteq \Bbb R \to \Bbb R$ and to ideas that should be simple enough for anyone who's taken a course in scalar calculus to understand. Let me know if I haven't succeeded in some way.


First, it'll be convenient for us to define a new notation. It's called "little oh" notation.

Definition: A function $f$ is called little oh of $g$ as $x\to a$, denoted $f\in o(g)$ as $x\to a$, if

$$\lim_{x\to a}\frac {f(x)}{g(x)}=0$$

Intuitively this means that $f(x)\to 0$ as $x\to a$ "faster" than $g$ does.

Here are some examples:

  • $x\in o(1)$ as $x\to 0$
  • $x^2 \in o(x)$ as $x\to 0$
  • $x\in o(x^2)$ as $x\to \infty$
  • $x-\sin(x)\in o(x)$ as $x\to 0$
  • $x-\sin(x)\in o(x^2)$ as $x\to 0$
  • $x-\sin(x)\not\in o(x^3)$ as $x\to 0$

Now what is an affine approximation? (Note: I prefer to call it affine rather than linear -- if you've taken linear algebra then you'll know why.) It is simply a function $T(x) = A + Bx$ that approximates the function in question.

Intuitively it should be clear which affine function should best approximate the function $f$ very near $a$. It should be $$L(x) = f(a) + f'(a)(x-a).$$ Why? Well consider that any affine function really only carries two pieces of information: slope and some point on the line. The function $L$ as I've defined it has the properties $L(a)=f(a)$ and $L'(a)=f'(a)$. Thus $L$ is the unique line which passes through the point $(a,f(a))$ and has the slope $f'(a)$.

But we can be a little more rigorous. Below I give a lemma and a theorem that tell us that $L(x) = f(a) + f'(a)(x-a)$ is the best affine approximation of the function $f$ at $a$.

Lemma: If a differentiable function $f$ can be written, for all $x$ in some neighborhood of $a$, as $$f(x) = A + B\cdot(x-a) + R(x-a)$$ where $A, B$ are constants and $R\in o(x-a)$, then $A=f(a)$ and $B=f'(a)$.

Proof: First notice that because $f$, $A$, and $B\cdot(x-a)$ are continuous at $x=a$, $R$ must be too. Then setting $x=a$ we immediately see that $f(a)=A$.

Then, rearranging the equation we get (for all $x\ne a$)

$$\frac{f(x)-f(a)}{x-a} = \frac{f(x)-A}{x-a} = \frac{B\cdot (x-a)+R(x-a)}{x-a} = B + \frac{R(x-a)}{x-a}$$

Then taking the limit as $x\to a$ we see that $B=f'(a)$. $\ \ \ \square$

Theorem: A function $f$ is differentiable at $a$ iff, for all $x$ in some neighborhood of $a$, $f(x)$ can be written as $$f(x) = f(a) + B\cdot(x-a) + R(x-a)$$ where $B \in \Bbb R$ and $R\in o(x-a)$.

Proof: "$\implies$": If $f$ is differentiable then $f'(a) = \lim_{x\to a} \frac{f(x)-f(a)}{x-a}$ exists. This can alternatively be written $$f'(a) = \frac{f(x)-f(a)}{x-a} + r(x-a)$$ where the "remainder function" $r$ has the property $\lim_{x \to a} r(x-a)=0$. Rearranging this equation we get $$f(x) = f(a) + f'(a)(x-a) -r(x-a)(x-a).$$ Let $R(x-a):= -r(x-a)(x-a)$. Then clearly $R\in o(x-a)$ (confirm this for yourself). So $$f(x) = f(a) + f'(a)(x-a) + R(x-a)$$ as required.

"$\impliedby$": Simple rearrangement of this equation yields

$$B + \frac{R(x-a)}{x-a}= \frac{f(x)-f(a)}{x-a}.$$ The limit as $x\to a$ of the LHS exists and thus the limit also exists for the RHS. This implies $f$ is differentiable by the standard definition of differentiability. $\ \ \ \square$


Taken together the above lemma and theorem tell us that not only is $L(x) = f(a) + f'(a)(x-a)$ the only affine function who's remainder tends to $0$ as $x\to a$ faster than $x-a$ itself (this is the sense in which this approximation is the best), but also that we can even define the concept differentiability by the existence of this best affine approximation.

1
On

Think about the derivative in the sense that If you zoom in very close to any differentialiable (smooth curve) you'll get a straight line. The slope of that line is the derivative and is the best linear approximation for the function near that point. If any linear approximations fit better zoomed in that closely then by definition it would be closer to the slope of the function at that point then the derivative of the function at that point. This is then impossible.

2
On

I'll first give a intuitive answer, then an analytic answer.

Intuitively, the tangent goes in the same direction as the function, following it as closely as possible for a line. Any other line immediately starts to diverge from the function.

Analytically:

Consider the Taylor aproximation at $x$: $f(x+h) =f(x)+hf'(x)+h^2f''(x)/2+... $.

This means that, for small $h$ $f(x+h) \approx f(x)+hf'(x)+h^2f''(x)/2 $ so that the error $E(x, h) =f(x+h)- (f(x)+hf'(x)) $ is about $ h^2f''(x)/2 $.

Now consider any other line through $(x, f(x))$ with slope $s$, with $s \ne f'(x)$. At $x+h$, its value is $f(x)+sh$, so its error, $e(x, h)$ is $e(x, h, s) =f(x+h)-(f(x)+sh) $.

Since $f(x+h)-f(x) \approx hf'(x)+h^2f''(x)/2 $,

$\begin{array}\\ e(x, h, s) &=f(x+h)-(f(x)+sh)\\ &\approx hf'(x)+h^2f''(x)/2-sh\\ &= h(f'(x)-s)+h^2f''(x)/2\\ \end{array} $

so that $\dfrac{E(x, h)}{e(x, h, s)} \approx \dfrac{h^2f''(x)/2}{h(f'(x)-s)+h^2f''(x)/2} = \dfrac{hf''(x)/2}{f'(x)-s+hf''(x)/2} $.

Since $s \ne f'(x)$, as $h \to 0$, the numerator of thie ratio of errors goes to zero, while the denominator stays bounded away from zero.

Therefore the error of the tangent goes to zero faster than the error in any other line through the point.

That is why the tangent is the best linear approximation to the curve.

0
On

I think you might be confusing the derivative as a linear operator vs. the derivative evaluated at a point (a linear functional). The derivative itself does not approximate anything, it just gives you a function that tells you the rate of change of the original function for every value of x in the domain. Now, when you evaluate the derivative at a single point $x=a$, you are still one step removed from an approximation of your original function $f$ in the small neighborhood of $a$. This is because evaluating the derivative only gives you the rate of change for the function in the small neighborhood of $a$. Then you have to perform an affine transformation (i.e. a translation) of that value to arrive at your approximation.

So, when you think derivative, think $D:C^{k} \to C^{k-1}$ which is given by

$$Df=\frac{df(x)}{dx}=f'(x)$$

for some $f \in C^{k}$ and then for the derivative of a function evaluated at a specific point, think about a linear functional $E:C^{k} \to \mathbb{R}$ given by $$E[f']_{a}=f'(a)$$

6
On

There is a sense in which the derivative is the best linear approximation. You just have to define "best" approximation in a proper way, taking into account that the derivative is a very local property. In particular, suppose we are trying to approximate $f$ at $x_0$. Then, we make the following definition:

A function $g$ is at least as good of an approximation as $h$ if there is some $\varepsilon>0$ such that for any $x$ with $|x-x_0|<\varepsilon$ we have that $|g(x)-f(x)|\leq |h(x)-f(x)|$.

This is to say that, when we compare two functions, we only look at arbitrarily small neighborhoods of the point at which we are approximating. This defeats your strategy - if you take the tangent line and compare it to a secant line passing through $(a,f(a))$, this approximation will exclude $a$ from consideration by making $\varepsilon$ small enough. Essentially, the important thing is that you have to fix $\varepsilon$ after you fix the two functions which you want to compare. This is only a partial order (well, and not quite that), so sometimes there is no best approximation.

However, we have two theorems:

  • $f$ is differentiable at $x_0$ if and only if there is a linear function $g$ which is at least as good of an approximation as any other linear $h$.

  • If $f$ is differentiable at $x_0$, then $g(x)=f(x)+(x-x_0)f'(x)$ is the best linear approximation of $f$.

meaning this definition is equivalent to the usual one. Interestingly, we get the condition of continuity at $x_0$ if we ask for the best constant approximation to exist.

0
On

Take the function f (x) = $x^2$. At x = 0, the derivative gives you the function g (x) = 0. On every interval $[-a, +a]$ the function g (x) = $a^2/2$ will give a better approximation with a maximum error of $a^2/4$ instead of $a^2$.

However, that approximation will be worse on any interval smaller than $[-a/2, +a/2]$. g (x) = 0 will beat any other approximaton on any small enough interval.

2
On

This depends a lot on how we measure error. So we could turn the question around and ask for what definition of error will a first order Taylor approximation give the least error? You already have gotten good explanations from others on this, considering what happens as we take limits close to the point. So I could maybe contribute by adding something new. Say we want to find $p$ to minimize a norm of the difference of the functions.

$$\min_{p}\{\|f(x)-p(x)\|\}$$

However this $\|\cdot\|$ can be defined in many ways! Some popular ones are the weighted schatten norms:

$$ \|f(x)-p(x)\|_k = \sqrt[k]{\int_{-\infty}^\infty w(x)\left|f(x)-p(x)\right|^kdx}$$

We would be getting a solution close to gnasher729 $f(x) = x^2$ for example if we pick $w(x)$ to be a box function and let $k$ take values which approximate the max-norm which simply is the max absolute value on an interval.

I wonder what choices of $w(x)$ and $k$ will give us the first order Taylor approximation as the solution!

In fact in engineering, how to measure the error in a useful way can often be one of the toughest considerations.

0
On

Let's try to find the best fitting line to the parabola $$\text{$y = f(x) = x^2$ at the point $(1,1)$ of $f$.}$$ We require that $$f(1) = L(1).$$ So the line must look like $$L(x) = m(x-1) + 1 = mx - (m-1).$$ The difference between the two curves will be $$E(x) = f(x) - L(x) = x^2 - mx + (m - 1).$$ In order to emphasize that we are interested in the behaviour of $E(x)$ near $x = 1$ we consider the function $$E(1 + h) = (1+h)^2 - m(1+h) + (m - 1) = (2-m)h + h^2.$$

The term $(2-m)h \;$ is an $``\text{order of}\, h"\,$ error and is expressed as $O(h)$, pronounced big $O$ of $h$.

The term $h^2 \,$ is an $``\text{order of}\; h^2"$ error and is expressed as $O(h^2)$, pronounced big $O$ of $h^2$.

The basic idea is that, if $h$ is small, then $h^2$ is an order of magnitude smaller.

We see when $m \ne 2$ that the error is $O(h)$ and, if $m=2$, then the error is $O(h^2)$. It is in this sense that we say the line $L(x) = 2x - 1$ is the "best linear fit line" to $y = x^2$ at the point $(1,1)$.

Hence the best linear fit line, $y = mx + b$, to the curve $y = f(x)$ at the point $(x_0, f(x_0))$ must have these two properties:

  1. $f(x_0) = L(x_0)$
  2. $f(x_0 + h) = L(x_0 + h) + O(h^2)$

To have $L(x_0) = f(x_0)$, we need to have $L(x) = m(x - x_0) + f(x_0)$ If we define $m = f'(x_0)$, then we get $L(x) = f'(x_0)(x - x_0) + f(x_0)$. Hence conditions $(1.)$ and $(2.)$ can be combined to

  1. $f(x_0 + h) = f(x_0) + h f'(x_0) + O(h^2)$.

and that is the sense by which $L(x) = f'(x_0)(x - x_0) + f(x_0)$ is the best linear approximation to the curve $y = f(x)$ at the point $(x_0, f(x_0))$.

0
On

Here's a simple explanation of what's wrong with your argument. You're not understanding what is meant by "near".

The claim isn't that

for a given small interval it is the best approximation.

But this is what you are arguing --- given a $\delta>0$, it is true that you can find a better (or at least as good) approximation in $(x_0-\delta,x_0+\delta)$ with a different line. But what happens if $\delta$ shrinks? After all, what you call near, someone with a different perspective would call far. So maybe you found a good approximation on the solar system scale, but I'm a geologist, so we need to find a good one on my planetary scale (yours will fail now), but then we talk to a microbiologist and my approximation is no good now, (and of course a string theorist is next).

Really the claim is

you cannot find a better approximation near $x_0$

and here "near $x_0$" is a key part of the definition. We say approximation $A$ is better than approximation $B$ near $x_0$ if I can find a small enough $\delta$ such that for any $\epsilon<\delta$ approximation $A$ is always better than $B$ in $(x-\epsilon, x+\epsilon)$.

If you take any one interval and the approximation you described, you'll find that for a small enough interval that approximation is not as good as the tangent.

0
On

Let $y=f(x)$ be a (differentiable) function that we are trying to approximate around the point $P(a,f(a))$. The simplest way to approximate its behaviour around that point is by fitting a linear function $g(x)=mx+c$ to it. We can then define 'the best linear approximation' to $f$ as the function $g$ that has the following property: $$ \lim_{x \to a}\frac{f(x)-g(x)}{x-a}=0 \, . $$ What this criterion tries to capture is the 'relative error' of $g$. If $x$ is very close to $a$, then $g(x)$ should be closer still to $f(x)$. Simple algebraic manipulation shows that the only function $g$ that satisfies this property is the tangent to $P$. Let $$ h(x)=\frac{f(x)-g(x)}{x-a} \, . $$ Then, $f(x)-g(x)=h(x)(x-a)$. This means that $$ \lim_{x \to a}f(x)-g(x) = \lim_{x \to a}h(x) \cdot \lim_{x \to a}(x-a)=0 \, . $$ Therefore, $$ \lim_{x \to a}f(x)=\lim_{x \to a}g(x) \, , $$ which implies $f(a)=g(a)$ since both $f$ and $g$ are differentiable and hence continuous. Unsurprisingly, the 'best linear approximation' of a function around the point $x=a$ should be exactly equal to the function at the point $x=a$. Using the point-slope form of the equation of a line, we find that $$ g(x) = m(x-a) + g(a) = m(x-a) + f(a) \, . $$ We are now tasked with proving that $m=f'(a)$. Luckily, this is not too difficult: \begin{align} & \lim_{x \to a}\frac{f(x)-g(x)}{x-a}=0 \\[4pt] \implies & \lim_{x \to a}\frac{f(x)-g(x)}{x-a}=0 \\[4pt] \implies & \lim_{x \to a}\frac{f(x)-f(a)+f(a)-g(x)}{x-a}=0 \\[4pt] \implies & \lim_{x \to a}\frac{f(x)-f(a)}{x-a} + \lim_{x \to a}\frac{f(a)-g(x)}{x-a}=0 \\[4pt] \implies & \lim_{x \to a}\frac{f(x)-f(a)}{x-a} + \lim_{x \to a}\frac{g(a)-g(x)}{x-a}=0 \\[4pt] \implies & f'(a) - g'(a) = 0 \\[4pt] \implies & f'(a) = g'(a) \end{align} Since $g'(a)=m$, we find that $g$ must have the equation $$ g(x) = f'(a)(x-a) + f(a) \, . $$ But this is the equation of the tangent to $P$, and so, in this sense, the derivative gives the best linear approximation of $f(x)$ around a certain point.