Is there a "No Free Lunch Theorem" for Polynomial Approximation?

368 Views Asked by At

In search and optimization, a popular (set of) theorem(s) exist called the "No Free Lunch Theorem(s)". The authors of this theorem describe it as "for any algorithm, any elevated performance over one class of problems is offset by performance over another class" (https://ieeexplore.ieee.org/document/585893). These theorems have important interpretations in many applied fields such as Machine Learning. From a layman's perspective, the implication of the "No Free Lunch Theorem" asserts that there is no "universally best" class of algorithm for all "classes of optimization problems".

I was wondering if a similar type of theorem exists for Polynomial Approximation. For example, we know that there are many methods in mathematics that can be used for function approximation - for instance, Taylor Polynomials, Lagrange Polynomials, Padé Polynomials, etc.

As in the case of Machine Learning algorithms, can any claims similar to those asserted by the "No Free Lunch Theorems" be made regarding different methods of Polynomial Approximations? For example - is it possible that Taylor Polynomials might "better" approximate some functions compared to Padé Polynomials; but by virtue of this fact, there also might exist some other functions that Padé Polynomials approximate better than Taylor Polynomials? Or are have we been able to mathematically determine that some approximation methods are "universally better" than other approximation methods for broad classes of functions?

Thank you!

Note: I suppose that in this context, "better" can be quantified through properties such as "strength of convergence", "size of the radius of convergence" or properties of the "bounded error" (e.g. magnitude) - but I am not sure about this.

2

There are 2 best solutions below

3
On

There are plenty of functions that cannot be approximated by polynomials at all. The simplest one is probably

$$f(x) = |x|$$

If you actually try to fit a Lagrange polynomial onto $n = 3,5,7,9,11,13$ points, it would end up looking like this:

Polynomial interpolation for |x|

Polynomials aren't the end-all-be-all of function approximation. We think that they do work quite well on analytic (infinitely differentiable) functions, but "most" functions are actually non-differentiable, like $f(x) = |x|$ is.

Of course, in reality, you would likely not try to fit a polynomial onto $|x|$, but it's reasonable to assume that there are formulas you could derive from the real world that do include absolute values or points where the function isn't differentiable.

A good example for this is superconductivity. The resistance for superconductors actually drops from a measurable, positive value to $0$, as soon as it hits a temperature threshold:

Superconductivity graph

If you didn't know this and tried to fit a polynomial onto some points measured from the red curve, you'd be disappointed, because it wouldn't work.

So all in all, the "No Free Lunch" theorem would be something like:

$\forall L > 0$, $\forall f: \mathbb{R} \to \mathbb{R}$ non-differentiable at $a\in\mathbb{R}$ and $\lim_{x\to a+} f'(x) \ne \lim_{x\to a-} f'(x)$, then $\forall p: \mathbb{R} \to \mathbb{R}$ polynomials $|\{x\in\mathbb{R}:|f(x)-p(x)|>L\}|=|\mathbb{R}|$

Meaning that no matter how much you try to approximate such an $f$ with a polynomial $p$, there will always be uncountably infinitely many (indeed, $\beth_1$) points in $\mathbb{R}$, where $p(x)$ is arbitrarily far from $f(x)$.

This is easy to see from the absolute value example. For a function $f$ with the property

$$\lim_{x\to a+} f'(x) \ne \lim_{x\to a-} f'(x)$$

it is "locally an absolute value function" near $a$. (For the actual absolute value function the left side is $1$, while the right side is $-1$.) This means that even in the local area around $a$, it isn't possible to approximate $f$ with any polynomial.

0
On

we know that there are many methods in mathematics that can be used for function approximation - for instance, Taylor Polynomials, Lagrange Polynomials, Padé Polynomials, etc.

Just to sort that out:

  • Lagrange polynomials are used for polynomial-type interpolation of functions. This means you have a set of points $P_i = (x_i, y_i)$ that you want to meet exactly, i.e. the interpolation $p(x)$ shall satisfy $p(x_i) = y_i$ for all $i$.

  • Padé approximants of order $[m/n]$ are rational approximations of form $p_m(x)/q_n(x)$ where $p_m$ and $q_n$ are polynomials of degree $m$ and $n$, respectively. In the special case $n=0$ we get the Taylor polynomial, but in general you want to benefit from the more general form of Padé.

The common trait of Taylor and Padé is that they are local methods, i.e. all you need to know is the $k$-th derivatives of a function at some point $x_0$ and the order $n$ resp. $[n/m]$ to which you want to approx some (analytic) function. The complexity of evaluating a polynomial of degree $n$ is $n$ additions and $n$ multiplications (using Horner). Thus we get the following costs in terms of basic operations Add, Mul, Div and MAcc$^3$:

          +    ×         /   MAcc
Taylor:   n    n         0     n
Padé:     n+m  n+m-1     1   m+n-1

The "-1" in Padé assumes that numerator or denominator polynomial has a highest coefficient of $1$. These costs depend on the arithmetic you are using, like floating-point vs. fixed-point, hardware vs. emulated in software, etc.

Then you'll have to know that usually Padé excels when $n\approx 2m$, i.e. the degree of the numerator is about twice the degree of the denominator. Padé of degree $[n/m]$ usually performs better (in terms of approximation error) than Taylor of degree $n+m$, and the error decreases faster for Padé as $n$ and $m$ increase. Advantage of Padé over Taylor is usually more pronounced for functions with a finite radius of convergence (RoC, like $\arctan$) compared to function with infinite RoC (like $\exp$ or $\sin$).

A main competitor of Taylor/Padé are MiniMax polynomials resp. MiniMax rational approximations. "MiniMax" stands for "minimizing the $\max$-norm (over some interval)". By their definition, these methods are non-local, i.e. you'll have to know the function on the whole interval where it's supposed to be approxed — as opposed to Taylor/Padé where knowing properties at the expansion point $x_0$ is enough. The MiniMax property means that

$$\|f-a\|_\max \stackrel!= \min$$

where $f$ is the function to be approximated, $a$ is a proximum, and the norm is taken over some interval $I$. This means that the proximum is the best$^1$ solution and thus beats Taylor resp. Padé of same order. However, you'll have to know the interval $I$ in advance and you'll have to pre-compute the proximum, which is much more complicated and tricky than for Taylor/Padé. In particular, MiniMax results will depend on the interval. The costs in terms of basic operations are the same like for Taylor (MiniMax polynomial) resp. Padé (MiniMax rational approximation) of same degree.

What also has to be considered is whether you want to minimize the absolute error (usually with fixed-point arithmetic) or the relative error (usually with floating-point arithmetic). With Taylor/Padé you have no choice, but with MiniMax you'll get different best approximations depending on wether you minimize for absolute error or for relative error. This is particular important in the presence of zeros, where the relative error is not well-defined$^2$.

There are many other means to approximate like CORDIC, lookup-tables (LUTs) or LUT + linear or higher order approximation like cubic splines, just to name a few.

What you also have to consider is whether you know the function explicitly, like $\sin$, or just by it's values like by means of measurements from some temperature sensor which you want to model in software in order to map voltage measurements to temperature readings.

If the function you want to approximate is not smooth, then Taylor cannot work because — assuming it converges — it won't represent the target function because it will represent the analytic continuation. An example is when you want to approximate $|x|$. Using Taylor at 0 won't work because $k$-th derivatives are not defines, and expanding around, say $0.01$ won't work for obvious reasons, either. Padé has the same problems, with the exception that it can model poles$^4$. MimiMax don't work well in the presence of singularities like piece-wise defined functions. The approximation error just decreases very slowly as the oder increases; better use piece-wise defined functions from the start.

Now all this is no "Free Lunch Theorem" of any rigor, but this short answer should make it clear that there is no one-fits-all-solution and that there are many factors that influence your choice, from purely mathematical properties of the methods to down-to-earth capabilities of your computer or microprocessor. Just to recapture:

  • Is the function to be approximated over an interval that's known in advance?

  • Is the approximation error that has to be met known in advance?

  • You want to minimize absolute error, relative error?

  • What are the capabilities of the underlying hardware? Speed of execution, available memory for storage, e.g. for LUTs or to implement algorithms.

  • Are the algorithms going to use floating-point arithmetic or fixed-point arithmetic? In the latter case, you'll have to care about how you perform specific calculations so that no intermediate result will overflow the range that fixed-point can represent. This range is usually considerably smaller than the range floating-point can cover.

  • Are the functions known explicitly?

  • Are the functions well-behaved (smooth)?

  • Some algorithms like CORDIC are just available for a specific, small set of functions. Methods other than CORDIC apply to a vastly larger space of functions.


$^1$The proximum is unique in most practical applications.

$^2$What you usually do is to factor out zeros, for example when you want to approximate $\sin$ around 0, you'll better approximate $g(x) = \sin(\sqrt x)/\sqrt{x}$ and then use $\sin x = x\cdot g(x^2)$. Notice that $g(0)\neq0$ and thus $\sin x$ scales with $x$ when $|x|\approx 0$, which basically turns the absolute error of $g(0)$ into a relative error of $\sin x$ around 0.

$^3$MAcc stands for "multiply-accumulate", a fused multiply-add instruction implemented on most modern hardware. Thus one MAcc can replace (in terms of arithmetic) one Mul and one Add, i.e. $\operatorname{macc}(x,y,z) = x+yz$. It's usually faster, uses less program memory and comes with smaller rounding errors than performing one multiply and one add separately.

$^4$In the case of poles; for any practical purposes you would remove the pole by "multiplying it out", approximate the smoothed function and then divide by the appropriate factor to instanciate the pole.