Why do differentiation rules work? What's the intuition behind them? (Not asking for proofs)

9.4k Views Asked by At

Differentiation rules have been bugging me ever since I took Basic Calculus. I thought I'd develop some intuitive understanding of them eventually, but so far all my other math courses (including Multivariable Calculus) take the rules for granted.

I know how to prove some of the rules. The problem is that algebra manipulation alone isn't quite convincing to me. Is there any possibility of understanding why the algebra happens to work that way? For example, why do the slopes of the tangent line to the parabola x^2 happen to be determined by 2x? Looking at it graphically, there's no way I could've told that.

Any sources covering this issue (books; internet sites; etc) would be very greatly appreciated. Thanks in advance.

13

There are 13 best solutions below

4
On

The key intuition, first of all, is that the product of two tiny differences is negligible. You can intuit this just by doing computations:

$$3.000001 \cdot 2.0001 = 6.0003020001$$

If we are doing any sort of rounding of hand computations, we'd likely round away that $0.0000000001$ part. If you were doing computations to eight significant digits, a value $v$ is really a value in a range roughly of $v\left(1 \pm 10^{-8}\right)$ and the error when you multiply $v_1$ by $v_2$ is almost entirely $10^{-8}|v_1v_2|$. The other part of the error is so tiny you'd probably ignore it.

Case: $f(x)=x^2$

Now, consider a square with corners $(0,0), (0,x), (x,0), (x,x)$. Grow $x$ a little bit, and you see the area grows by proportionally by the size of two of the edges, plus a tiny little square. That tiny square is negligible.

This is a little harder to visualize for $x^n$, but it actually works the same way when $n$ is a positive integer, by considering an $n$-dimensional hypercube.

This geometric reason is also why the circumference of a circle is equal to the derivative of its area – if you increase the radius a little, the area is increased by approximately that "little" times the circumference. So the derivative of $\pi r^2$ is the circumference of the circle, $2\pi r$.

It's also a way to understand the product rule. (Or, indeed, FOIL.)

Case: The chain rule

The chain rule is better seen by considering an odd-shaped tub. Let's say that when the volume of the water in a tube is $v$ then the tub is filled to depth $h(v)$. Then assume that we have a hose that, between time $0$ and time $t$, has sent a volume of $v(t)$ water.

At time $t$, what is the rate that the height of the water is increasing?

Well, we know that when the current volume is $v$, then the rate at which the height is increasing is $h'(v)$ times the rate the volume is increasing. And the rate the volume is increasing is $v'(t)$. So the rate the height is increasing is $h'(v(t)) \cdot v'(t)$.

Case: Inverse function

This is the one case where it is obvious from the graph. When you flip the coordinates of a Cartesian plane, a line of slope $m$ gets sent to a line of slope $1/m$. So if $f$ and $g$ are inverse functions, then the slope of $f$ at $(x,f(x))$ is the inverse of the slope of $g$ at $(f(x),x)=(f(x),g(f(x)))$. So $g'(f(x))=1/f'(x)$.

$x^2$ revisited

Another way of dealing with $f(x)=x^2$ is thinking again of area, but thinking of it in terms of units. If we have a square that is $x$ centimeters, and we change that by a small amount, $\Delta x$ centimeters, then the area is $x^2\mathrm{cm}^2$ and it goes to approximately $f(x+\Delta x)-f(x)=f'(x)\Delta x$.

On the other hand, if we measure the square in meters, it has side length $x/100$ meters and area $(x/100)^2$. The change in the side length is $(\Delta x)/100$ meters. So the expected area change is $f'(x/100)\cdot (\Delta x)/100$ square meters. But this difference should be the same, so $$f'(x)\Delta x = f'(x/100)\cdot\frac{\Delta x}{100}\cdot \left(100^2 \text{m}^2/\text{cm}^2\right) = 100 f'(x/100)$$

More generally, then, we see that $f'(ax)=af'(x)$ when $f(x)=x^2$ by changing units from centimeters to a unit that is $1/a$ centimeters.

So we see that $f'(x)$ is linear, although it doesn't explain why $f'(1)=2$.

If you do the same for $f(x)=x^n$, with units $\mu$ and another unit $\rho$ where $a\rho = \mu$, then you get that the a change in volume when changing by $\Delta x\,\mu$ is $f'(x)\Delta x\,\mu^n$. It is also $f'(ax)\cdot a(\Delta x)\,\rho^n$. Since $\mu/\rho = a$, this means $f'(ax) =a^{n-1}f'(x)$.

Again, we still don't know why $f'(1)=n$, but we know $f'(x)=f'(1)x^{n-1}$.

0
On

Okay, I'm not sure if this is what you are asking but this was my intuition when I was a calculus student:

A deriviative is a formula for the rate of change at various points of the function. (Assuming the function doesn't jump about or veer sharply.)

The rate of change is the slope of a tangent line to the function at a point.

We find the slope of a line by taking two points and finding the fraction of "the rise over the run".

We don't know the slope of tangent line but if we take two points of the function we can find the slope of that line.

As these two points get really close together so that they actually are the same point That will be the tangent. (Formally, this is ... mush. If they are the same point they aren't two points but one and the "rise over the run" will be 0/0 but just before that point they'll be two point and that slope will be really, really, really close to the slope of the tangent line.

So the slope of the tangent line is the rise over the run of these two close points... or in other words $\lim \frac{f(x) - f(y)}{x - y} $ as x and y get really close together.

Well, replace y with x + h and this is $\lim \frac{f(x) - f(x+h)}{h}$. The ol' definition for the derivative.

======

Or. What is the derivative of $x^2$ at x? That is "fast" the function is growing at x. So how much bigger is $(x + h)^2$ than $x^2$, well as $(x + h)^2 = x^2 + 2hx + h^2$. So the function has gotten $2hx + h^2$ bigger. The $h^2$ is negligible so in essence it got $2hx$ bigger. How "long" did it take to get this much bigger? Well, it did it in $h$ units. So it got that much bigger at a rate of $2hx$ units per $h$ units or simply $2x$.

0
On

For the first hundred years or so, before people formalized differentiation and integration by using limits, the general intuition behind taking the derivative of $f(x)$ was, "Let's add a tiny increment to $x$ and see how much $f(x)$ changes."

The "tiny increment" was called $o$ (lower-case letter O), at least by some people.

For $f(x) = x^2$, for example, you could show that $$f(x + o) = (x + o)^2 = x^2 + 2xo + o^2 = f(x) + 2xo + o^2.$$ So the amount of "change" in $f(x)$ is $2xo + o^2$, which is $2x + o$ times the amount by which you changed $x$. And then the mathematicians would say that only the $2x$ part of $2x + o$ matters, since $o$ is "vanishingly" small.

I think for most of the differentiation rules developed back then (which may be all you'll see in the table of derivatives in an elementary calculus book), the intuition was to do the arithmetic. What they did not do was to encumber that arithmetic with all the extra mechanisms needed to establish a limit, as the standard-analysis approach does today.

On the other hand, the arithmetic usually went hand-in-hand with practical problems (usually in what we would consider physics or engineering) that people wanted to solve. People also tended to make a connection between arithmetic and geometry, so linking the function $f(x) = x^2$ to the area of a square of side $x$ would have been an obvious thing to do (and the visualization in Thomas Andrews's answer would have worked very well, I think).

For example, visualize a particle running along a circular track at a constant speed. In fact, make the circular track be the circle given by $x^2 + y^2 = 1$ in the Cartesian plane. (Putting everything into Cartesian coordinates was all the rage when calculus was young.) You can then see (by symmetry, or by other arguments) that the direction the particle is going is always perpendicular to the direction in which the particle lies from the center of the circle at that moment. So if the angle to the particle at that instant is $\theta$, the $x$-coordinate of the particle is $\sin\theta$, but the velocity vector is pointing in a direction $\frac\pi2$ radians "ahead" of $\theta$, and if we let $\theta$ increase at the rate of $1$ radian per unit of time the magnitude of the velocity is $1$, so its $x$-coordinate is $\sin\left(\theta + \frac\pi2\right) = \cos\theta$, which is the derivative of $\sin\theta$ when $\theta$ is measured in radians.

0
On

It is wonderful that you are wondering about this. Many people just solve problems by applying formulas automatically without caring about "how" or the history of it all. In fact the history is very interesting too.

Now, your question is broad. For example the part "The problem is that algebra manipulation alone isn't quite convincing to me. Is there any possibility of understanding why the algebra happens to work that way?" is addressed in the principles behind Calculus. Most Calculus books spend some effort on the basics but at the end of the day, the final laws are what get used in practice. Math. students (at least) get to study subjects such as Real analysis and "Analysis of Complex Variables". Such subjects focus on the science behind those magical formulas. You are correct in finding that intuition is not good enough for all this stuff. While you could obtain books in "Analysis" and review such concepts, they are usually written for advanced learners and may not be easy to digest. A good Calculus book should cover the essential concepts well.

As for your point "why do the slopes of the tangent line to the parabola x^2 happen to be determined by 2x?" - A good discussion with pictures can be found here: finding the tangent of a parabola algebraically.

3
On

Some visual images:

  • For $\frac{d}{dx} x^k = kx^{k-1}$:
    • A square of side $x$ has one corner fixed at the origin. The square grows to the upper right, by an amount proportional to the length of the upper and right sides, whose combined lengths are $2x$.
    • A cube of side $x$ has one corner fixed at the origin. The cube grows to the back upper right, by an amount proportional to the areas of the upper, right, and back faces, whose combined areas are $3x^2$.
    • And so on$\ldots$
  • For $\frac{d}{dx} \sin x = \cos x, \frac{d}{dx} \cos x = -\sin x$: The values of sin and cos for any $x$ are represented by a point on the unit circle at argument (that is, angle) $x$. The direction of change, however, is the tangent to that point in the counter-clockwise direction, which equals the sin and cos of a point $\pi/2$ radians on. So $\frac{d}{dx} \sin x = \sin (x+\pi/2) = \cos x$, and $\frac{d}{dx} \cos x = \cos (x+\pi/2) = -\sin x$.
  • For $\frac{d}{dx} e^x = e^x$: I'd love to come up with a transferable visceral intuition for this. I'll have to give this more thought. ETA: Best I can come up with is the usual compound interest argument. Imagine I have $\$100$ invested in an account earning $100$ percent interest. (I said "imagine".) If it's compounded annually, then I just end up with $\$100$, times $1+1 = 2$, or $\$200$. If it's compounded semi-annually, then I end up with $\$100$, times $1+1/2 = 3/2$, times $3/2$ again, or $\$225$. If it's $k$ times a year, then it's $\$100$, times $1+1/k$, times $1+1/k$, etc., a total of $k$ times, or $\$100$, times $(1+1/k)^k$. The limit, as $k \to \infty$ is $\$100$, times $e$. So $\frac{dy}{dx} = y$ leads to $y(x+1) = e \times y(x)$, or $y(x) = Ce^x$.
2
On

Derivative is the study of linear approximation. For example, $$ (x+\delta)^{2}=x^{2}+2x\delta + \delta^{2}. $$ The linear term has slope $2x$ at $x$, which is the coefficient of the term that linear in $\delta$. The linear term is the derivative: $$ f(x+\delta) = f(x)+f'(x)\delta+\mbox{higher order $\delta$ terms} $$ So, for example, the derivative of $fg$ is obtained by finding the linear terms in \begin{align} (fg)(x+\delta) &=\{ f(x)+f'(x)\delta+\cdots\}\{ g(x)+g'(x)\delta+\cdots\} \\ & = f(x)g(x)+\{f(x)g'(x)+f'(x)g(x)\}\delta+\cdots \end{align} $$ \implies (fg)'(x)=f(x)g'(x)+f'(x)g(x). $$

1
On

The intuition for this bothered me for a while too when I first learned about it. The standard argument based on limits and thinking about small changes seemed very mechanical and lacking in insight.

Since the chain rule seems very intuitive to me, what finally satisfied me was the following argument (requires a very small amount of multivariable calc/linear algebra), $$\text{multidimensional chain rule} \implies (\text{derivative of }x^2) =2x.$$ Specifically, take the following functions $g:\mathbb{R}\rightarrow \mathbb{R}^2$, and $f:\mathbb{R}^2 \rightarrow \mathbb{R}$ such that $g$ lifts $x$ into 2 dimensions by making a copy of it, then $f$ brings it back down to one dimension by multiplying the two copies, \begin{align} g(x) &:= \begin{bmatrix}x \\ x\end{bmatrix}, \quad\quad g'(x) = \begin{bmatrix}1 \\ 1\end{bmatrix} \\ f(x,y) &:= x\cdot y, \quad\quad f'(x,y) = \begin{bmatrix}y & x\end{bmatrix} \end{align} The composition of these functions is the 1D function we want, $$f(g(x)) = x^2.$$ By the chain rule, the derivative of the composition is the composition of the derivatives, which is, $$(f \circ g)'(x) = f'(x,x) \circ g'(x) = \begin{bmatrix}x & x\end{bmatrix}\begin{bmatrix}1 \\ 1\end{bmatrix} = 2x.$$

The same technique (lifting to higher dimensions + chain rule) also explains the product rule in general.

0
On

$$x^n = \underbrace{x\times x\times\cdots\times x}_{\text{n factors}}$$

If you replace $x\longrightarrow x + dx$, and work out the product, then the term proportional to $dx$ will be $n x^{n-1}dx$ because if you pick a $dx$ from a factor you can't pick $x$ from there anymore and there are $n$ places you can choose to pick your $dx$ term from.

2
On

Matt24, it's sad that the world of mathematics has come to a point where many mathematicians, asked for the intuition behind a certain principle in math, respond in the majority with a chalkboard or a textbook chapter full of symbols. ;) Symbols are fine, but they don't substitute for an understanding; you can only successfully symbolize something if you understand it first.

I don't think I can improve on the elegance and the simplicity of Thompson's intuitive explanations of Calculus. This is a very old book, but it's still the best textbook I've ever seen on Calculus. The beginning of Chapter IV precisely answers this question. (Book pages 18 and 19—but they're pages 32 and 33 on the PDF, because of the table of contents et. al.) http://djm.cc/library/Calculus_Made_Easy_Thompson.pdf

(In the excerpt, I am using ^ for exponentiation and . for multiplication.)

Let us begin with the simple expression y = x^2. Now remember that the fundamental notion about the calculus is the idea of growing. Mathematicians call it varying. Now as y and x^2 are equal to one another, it is clear that if x grows, x^2 will also grow. And if x^2 grows, then y will also grow. What we have got to find out is the proportion between the growing of y and the growing of x. In other words our task is to find out the ratio between dy and dx, or, in brief, to find the value of dy/dx.

Let x, then, grow a little bit bigger and become x+dx; similarly, y will grow a bit bigger and will become y+dy. Then, clearly, it will still be true that the enlarged y will be equal to the square of the enlarged x. Writing this down, we have:

y+dy = (x+dx)^2.

Doing the squaring we get:

y+dy = x^2 + 2x.dx + (dx)^2

What does (dx)^2 mean? Remember that dx meant a bit—a little bit—of x. Then (dx)^2 will mean a little bit of a little bit of x; that is, as explained above (p. 4), it is a small quantity of the second order of smallness. It may therefore by discarded as quite inconsiderable in comparison with the other terms. Leaving it out, we then have:

y+dy = x^2 + 2x.dx

Now y=x^2; so let us subtract this from the equation and we have left

dy=2x.dx

Dividing across by dx, we find

dy/dx = 2x.

Now this is what we set out to find. The ratio of the growing of y to the growing of x is, in the case before us, found to be 2x.

0
On

Chain Rule: We want the derivative of $f(g(x))$ with respect to $x$. Derivatives, of course, are taken by minuscule changes in the target variable ($\frac{df(x)}{dx}$ is small change in rise over small change in run). We assume that $g(x)$ is continuous and differentiable, and thus a small enough change in $x$ will result in a small change in $g(x)$. We can then treat $g(x)$ as a variable, letting $u=g(x)$, then $\frac{df(u)}{dx} = \frac{df(u)}{dx} \cdot \frac{du}{du} = \frac{df(u)}{du} \cdot \frac{dg(x)}{dx}$.

Rule that $\frac{d}{dx}x^2=2x$: This one can be related to simple geometry by the fundamental theorem of calculus (FTC). The area of a right triangle is $\frac{1}{2}bh$. Graphically, this is interpreted as $\frac{1}{2}xf(x)$. Recall that $f(x)$ is a linear function, i.e. $f(x)=mx$ with slope $m=\frac{h}{b}$. In other words, the area of the triangle formed under linear function $f(x)=mx$ is given by $\frac{1}{2}xf(x) = \frac{1}{2}mx^2$. By the FTC, the derivative of $\frac{1}{2}mx^2$ is the function $mx$. The generalized rule, $\frac{d}{dx}x^n = nx^{n-1}$, is best understood algebraically (as shown in the other answers).

Inverse function rule: Nicely understood graphically, link with picture and link with picture. So basically $\frac{d}{dx}f^{-1}(x) = \frac{1}{f'(f^{-1}((x))}$. If you know that the derivative of $f(x)=e^x$ is $e^x$ then you can use the inverse rule to derive the derivative of $ln(x)$:

$$\frac{d}{dx}ln(x) = \frac{1}{e^{ln(x)}} = \frac{1}{x}$$

It's not graphical, but you can use logarithms for a short pseudoproof of the power rule:

$$y=x^n \Rightarrow ln(y)=n ln(x) \Rightarrow \frac{y'}{y}=\frac{n}{x} \Rightarrow y'=n\frac{x^n}{x} \Rightarrow y'=nx^{n-1}$$

Basic rules: Some of the most basic rules (addition rule, product rule, quotient rule) are difficult to understand graphically, but easily follow algebraically from the definition of derivative and integral.

Also, with the product rule you can derive the (integer) power rule: Assume we know the derivative of $x^n$ is $nx^{n-1}$ and that we know the derivative of $x$ is $1$. Well $x^{n+1}=x\cdot x^n$, so $$\frac{d}{dx}x^{n+1} = \frac{d}{dx}(x\cdot x^n) = x^n\frac{d}{dx}x+x\frac{d}{dx}x^n = x^n+x\cdot nx^{n-1} = (1+n)x^n$$

But once again, the algebraic explanation that was given in other answers $$\lim_{\delta \rightarrow 0}\frac{(x+\delta)^n - x^n}{\delta} = \frac{(x^n+{n\choose 1}x^{n-1}\delta+{n\choose 2}x^{n-2}\delta ^2+\ldots)-x^n}{\delta} = \frac{{n\choose 1}x^{n-1}\delta+{n\choose 2}x^{n-2}\delta ^2+\ldots}{\delta} = {n\choose 1}x^{n-1}+{n\choose 2}x^{n-2}\delta+\ldots = {n\choose 1}x^{n-1} = nx^{n-1}$$ is a pretty good explanation.

0
On

All of the derivative rules come from looking at the corresponding linear approximations.

At a point $p$, $f(x) \approx f(p)+f'(p)(x-p)$, and $g(x) \approx g(p)+g'(p)(x-p)$.

So it makes sense that the sum of the functions is well approximated by the sum of the tangent lines, and thus that the slopes sum.

Constant multiples work via the same reasoning.

The chain rule works like this:

$\begin{align*} f(g(x)) &\approx f(g(p)+g'(p)(x-p)) \text{ using linear approx of $g$ at $p$}\\ &\approx f(g(p))+f'(g(p))g'(p)(x-p) \text{ using linear approx of $f$ at $g(p)$} \end{align*}$

so the slope of $f \circ g$ at $p$ should be $f'(g(p))g'(p)$.

The only rules which does not work like this ("Tangent line to sum is sum of tangent lines", "tangent line to composition is composition of tangent lines", etc) is the product rule.

The problem is, when we multiply tangent lines we get a parabola, not a line. But it is okay, because we take the tangent line to that parabola in a fairly intuitive way:

$\begin{align*} f(x)g(x) &\approx (f(p)+f'(p)(x-p))(g(p) + g'(p)(x-p))\\ &=f(p)g(p)+(f'(p)g(p)+f(p)g'(p))(x-p)+f'(p)g'(p)(x-p)^2 \end{align*}$

But it is visually obvious that $(x-p)^2$ has zero slope at $p$, so the tangent line must just be $y=f(p)g(p)+(f'(p)g(p)+f(p)g'(p))(x-p)$. This yields the product rule.

Interestingly, this approach sheds new (?) light on the derivative of $f(x) =x^2$. Namely

\begin{align*} x^2 &= \left((x-p)+p\right)^2\\ &=p^2+2p(x-p)+(x-p)^2 \end{align*}

Since $(x-p)^2$ surely has zero slope at $p$, we can see that the slope of $f(x) = x^2$ at $p$ should be $2p$.

0
On

The rules of differentiation can all be derived from the definition $$ \frac{{\rm d} }{{\rm d}x} y(x) =\lim_{h\rightarrow0} \frac{1}{h}( y(x+h)-y(x)) $$

Instead of trying to interpret the argebraic results of the rules (ie. why is there a $2$ in front of the derivative of $x^2$) try to develop a geometric sense for the derivatives.

From Derivative ≡ Slope interpretation you notice things like scalar multiple of functions has a slope with the same scalar factor, and the slope of an even function is an odd function (and vise versa). Try not to focus why the rules are what they are, but on what they mean.

The rules are just a tool ( a shortcut if you might ) to spare you from doing the above limit every time. Some of the rules themselves can be derived from induction from more basic rules (like for $x^n$) whilst trying to derive them from the limit is just a lot more complex.

0
On

If you look for intuition, I cannot recommend the series "Essence of Calculus" by 3Blue1Brown enough!