I am reading a paper on the Information Bottleneck Method. In this paper authors give a short review of rate distortion theory. In rate distortion theory we are interested in the rate distortion function $R(D)$, which is defined as $$R(D) = \min_{p(\tilde x | x)} I(X, \tilde{X}) \qquad\text{such that}\quad \mathbb{E}_{p(x,\tilde x)}[d(x, \tilde x)] \leq D$$ where:
- $X$ is our source random variable. It is discrete and finite.
- $\tilde X$ is quantization of $X$ (this quantization is stochastic, i.e. we have a stochastic mapping $p(\tilde x | x)$)
- $d(x, \tilde x)$ is our distortion function, which measures the quality of quantization (in practice, it can be something like MSE error or Hamming distance)
- $\beta$ is the Lagrange multiplier
Authors say, that we do it by minimizing a Lagrangian:
$$\mathcal{F}(p(\tilde x | x)) = I(X, \tilde{X}) + \beta \mathbb{E}_{p(x,\tilde x)}[d(x, \tilde x)]$$
There are two moments that I do not understand:
Why second term in Lagrangian does not contain $D$? Wikipedia says, that inequality constraints "$g(x) \leq C$" should give second term of the kind "$\beta(g(x) - C)$", not "$\beta g(x)$". We somehow imply that $C = 0$ here?
Why do authors say, that we find a whole function $R(D)$? It looks like we are finding just a single value of this function, evaluated at the point $D$, not the whole form of it.
This is quite common in this kind of max/minimization problems.
In general, supose you want to find an extremum of $f(x)$ subject to $g(x)\le D$. The method of Lagrange multiplier does not adapt to this constraint. But, if if you have reasons to assume (we often have) that the constraint $g(x)\le D$ is "relevant / pessimistic", that is, that we wish $D$ to be bigger to get better extrema of $f(x)$, this implies that the extremum of $f(x)$ subject to $g(x)\le D$ will occur in the limit of the region, i.e., where $g(x)=D$ or $g(x)-D=0$. Then, we are justified in replacing the inequality by an equality.
Then, Lagrange multiplier tells us to find the extremum of the Lagrangian
$$J_1(x) = f(x) + \beta (g(x)-D) \tag1$$ together with $g(x)-D=0$. But, because $D$ is constant, this is equivalent to find the extremum of the alternative Lagrangian
$$ J_2(x) =f(x) + \beta' g(x) \tag2$$
But to find the extremum of $(2)$ is not the same as assuming $D=0$, because the additional equation $g(x)-D=0$ remains as is. Different values of $D$ will give different multipliers ($\beta \ne \beta'$) and hence different extrema.
Namely, if the extrema corresponds to a critical point, both $(1)$ and $(2)$ will lead to the same system of equations:
\begin{cases} f'(x)+ \beta g'(x)&=0\\ g(x) -D &= 0 \tag3 \end{cases}
The solution, of course, depends on $D$.