I've noticed that, if we input the linear approximation of $g$ near $a$ into $f$, we get a decent approximation (althought non-linear) of the linear approximation of $f \circ g$ near $a$. In order to actually have an identity between these two approximations, we need to image that, for a small enough neighborhood of $g$ near $a$, then $f$ also begins behaving like a linear approximation, such that $f \circ$ (the linear approximation of $g$ near $a$) actually becomes identical to the linear approximation of $f \circ g$ near $a$.
My question is: why exactly does $f$ get linearized near $a$ by this procedure? How come choosing a small enough neighborhood of $g$ near $a$ makes it such that $f$ gets linearized here? (I'm thinking of composition of functions and linear approximations in these terms to get a firmer grasp on the proof of the chain rule.) We could easily imagine that inputting the linear approximation of $g$ near $a$ into $f$ simply looks like a distortion of $g$ near $a$ in a non-linear fashion, all the way down to the smallest interval of the reals you might want to choose, so I'm having a hard time thinking through this problem.
Thank you!
Assuming $f$ and $g$ are differentiable in the region around $g(a)$ and $a$ respectively, we can expand each in a Taylor series. The result is just what you would expect from the chain rule. For $x$ small: $$g(a+x) \approx g(a)+xg'(a)\\ f(g(a+x)) \approx f(g(a))+x\frac d{dx}(f(g(a+x))\\f(g(a+x)) \approx f(g(a)+x(f'(g(a))g'(a)$$