I am currently studying "Understanding Machine Learning from Theory to Practice" written by Shai Shalev-Shwartz and Shai Ben-David. And i have trouble in understanding the presented proof of the No Free Lunch Theorem.
Theorem 5.1 (No-Free-Lunch): Let $A$ be any learning algorithm for the task of binary classification with respect to the $0 - 1$ loss over a domain $X$. Let $m$ be any number smaller than $|X|/2$, representing a training set size. Then, there exists a distribution $D$ over $X \times \{0, 1\}$ such that:
- There exists a function $f : X \rightarrow \{0, 1\}$ with $L_D(f) = 0$.
- With probability of at least $1/7$ over the choice of $S \sim D^m$, we have that $L_D(A(S)) \geq 1/8$.
Proof: Let $C$ be a subset of $X$ of size $2m$.
Note that there are $T = 2^{2m}$ possible functions from $C$ to ${0, 1}$. Denote these functions by $f_1, \ldots, f_T$. For each such function, let $D_i$ be a distribution over $C \times {0, 1}$ defined by:
$$D_i{(x,y)}=\begin{cases} 1/|C| & \text{if } y = f_i(x), \\ 0 & \text{otherwise} \end{cases} $$
We will show that for every algorithm, $A$, that receives a training set of $m$ examples from $C \times \{0, 1\}$ and returns a function $A(S) : C \rightarrow \{0, 1\}$, it holds that: (1)\begin{equation} \max_{i\in T}\mathbb{E}_{S \sim D_i^m}[L_D(A(S))] \geq \frac{1}{4}. \end{equation}
Clearly, this means that for every algorithm, $A'$, that receives a training set of $m$ examples from $X \times \{0, 1\}$ there exists a function $f : X \rightarrow \{0, 1\}$ and a distribution $D$ over $X \times \{0, 1\}$, such that $L_D(f) = 0$ and: (2)\begin{equation} \mathbb{E}_{S \sim D^m}[L_D(A'(S))] \geq \frac{1}{4}. \end{equation}
I dont fully understand why (1) implies (2). For (1) we defined defined our probability distribution $D_i$ over $C\times \{1,0\}$. But in (2) we are talking about distributions over $X\times \{1,0\}$. Do we just extend our $D_i$ to $X\times {0,1}$, so we assign every pair $(x,y)$ that is not in $C\times \{0,1\}$ probability zero.
Or maybe for a more general intuition: If we have a probability distribution $D$ over some set $C$ that is the subset of $X$. If i can make claims on $D$, it is possible to extend our $D$ to the Domain $X$ (lets call it $P$) and make claims about $P$ based on the claims of $D$.
$\def\calS{\mathcal{S}}$ $\def\calW{\mathcal{W}}$ $\def\qty#1{\left( #1 \right)}$ $\def\calX{\mathcal{X}}$ $\def\calY{\mathcal{Y}}$ $\def\abs#1{\lvert #1 \rvert}$ $\def\pdv#1#2{\frac{\partial #1}{\partial #2}}$ $\def\expect{\mathbb{E}}$ $\def\indic{\mathbb{1}}$ In their book "Understanding Machine Leaning" the authors look at algorithms for binary classification; all samples are drawn from a domain $\calX$ which has cardinality $\abs{\calX}$. They draw the training and testing sets from $C\subset\calX$ with cardinality $2m$ where $m\leq \frac{\abs{\calX}}{2}$.
If we think of a function $f$ as a list of pairs $(x,y)$ where $x\in C$ and $y\in\{0,1\}$ then for each $x$ there are two possible pairs $(x,0)$ and $(x,1)$. Since $C$ has $2m$ elements there are $T=2^{2m}$ such lists of pairs; hence there are $T$ functions that can be defined from $C$ to $\{0,1\}$. These are denoted as $f_1,\ldots,f_i, \ldots,f_T$.
Postulate 1 states that "there exists a distribution $D$ over $\calX \times \{0,1\}$ such that there exists a function $f:\calX \rightarrow \{0,1\}$ with $L_D(f)=0$." $L_D(h)$ is defined as, $$ L_D(h)=\Pr_{x\sim D}[h(x)\neq f(x)] $$ where $f$ is the "correct" labelling function and $x\sim D$ denotes that the data is drawn from a distribution $D$. All distributions $D$ are defined over an event space $\calX \times \{0,1\}$ as follows, $$ \begin{array}{l | c c} & 0 & 1 \\ \hline x_1 & \Pr[(x_1,0)] & \Pr[(x_1,1)] \\ x_2 & \Pr[(x_2,0)] & \Pr[(x_2,1)] \\ \vdots & \vdots & \vdots \\ x_{\abs{\calX}-1} & \Pr[(x_{\abs{\calX}-1},0)] & \Pr[(x_{\abs{\calX}-1},1)] \\ x_{\abs{\calX}} & \Pr[(x_{\abs{\calX}},0)] & \Pr[(x_{\abs{\calX}},1)] \\ \end{array} $$ where $\sum_{x \in \calX} \left(\Pr[(x,0)]+\Pr[(x,1)]\right) =1$ If we define a distribution $D_i$ over $C \times \{0,1\}$ s.t. $\Pr[(x,f_i(x))]=1/\abs{C}$ and zero otherwise then this is a valid distribution since, $$ \sum_{x \in C} \left(\Pr[(x,0)]+\Pr[(x,1)]\right) = \sum_{x \in C} \Pr[(x,f_i(x))] = \sum_{x \in C} \frac{1}{\abs{C}} = 1 $$ $f_i(x)$ is one of the $T$ functions that can be defined from $C$ to $\{0,1\}$. Since the data generated from $D_i$ have as the correct label function $f_i(x)$ it follows from the definition of $L_D$ that $L_{D_i}(f_i)=0$.
How many training sets $S_j=(x_1,\ldots,x_m)$ can we generate? Since each of the $m$ elements is generated with replacement we have $2m$ choices from $C$ and therefore we have $k=(2m)^m$ possible sets. If each training set contains the corresponding labels $f_i(x_j)$ then the authors label it $S^i_j$.
What is the probability of picking one specific training set? Since this is random sampling and under $D_i$ all $x \in C$ have equal probability of being chosen the probability is $\frac{1}{k}$; we can use each training set to calibrate an algorithm $A$. We can calculate the expected generalisation error, \begin{equation} \expect[L_{D_i}(A(S))] = \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) \tag{EL} \label{EL} \end{equation} It is important to understand the notation here: $A(S^i_j)$ means "the model $A$ calibrated using training set $S^i_j$." When we read $L_{D_i}(A(S^i_j))$ it does not mean "the loss function of model $A$ over the training set $S^i_j$"; we expect that this would be close to zero after the algorithm was calibrated over the training set. The authors mean, $$ L_{D_i}(A(S^i_j)) = \frac{1}{2m} \sum_{x \in C} \indic_{A(S^i_j)(x) \neq f_i(x) } =\Pr_{x \sim D_i}[A(S^i_j)(x) \neq f_i(x)] $$ which is the probability that the model output, after calibration using the training set $S^i_j$, differs from the correct label $f_i$ over the whole range of elements of $C$ (remember that since the training set was generated using $D_i$, $f_i$ gives the correct label). $\indic_E$ is equal to one if $E$ is true and zero otherwise.
There are $T$ possible functions and we can pick a function $f_i$ that maximizes (\ref{EL}), the expected generalization error. The authors denote this by writing, $$ \max_{i \in [T]} \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) $$ where $[T]=\{1,\ldots,T \}$. The average value of, $$ \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) $$ (where the average is taken over all possible functions) is, $$ \frac{1}{T} \sum_{i=1}^T \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) $$ Since, by definition, the maximum of a series of values is greater than the average of a series of values we can write, $$ \max_{i \in [T]} \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) \geq \frac{1}{T} \sum_{i=1}^T \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) $$
There is one specific training set $S^i_j$ for which $L_{D_i}(A(S^i_{j'})) \geq L_{D_i}(A(S^i_j))$ where $j' \neq j$. We can express this fact by writing, $$ \frac{1}{T} \sum_{i=1}^T \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) \geq \min_{j \in [k]} \frac{1}{T} \sum_{i=1}^T \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) $$ Since $j$ is fixed to a value that achieves this minimum we can rewrite the last inequality as, $$ \frac{1}{T} \sum_{i=1}^T \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) \geq \min_{j \in [k]} \frac{1}{T} \sum_{i=1}^T L_{D_i}(A(S^i_j)) $$ This sequence of inequalities is derived without any reference to the preceding analysis.
Given a training set $S_j$ we can find a set of elements of $C$, which we denote as $v_1, \ldots, v_p$ that are not members of $S_j$. If the training set $S_j$ contained $m$ distinct elements then $p=m$; however since the training sets are generated with replacement we have $p\geq m$ since it is likely that some elements of $S_j$ are repeated. If we choose an arbitrary function $h$ from the set of all possible functions from $C$ to $\{0,1\}$ then we can calculate its generalisation error by using the following expression, $$ L_{D_i}(h)=\frac{1}{2m} \sum_{x \in C} \indic_{h(x) \neq f_i(x)} =\Pr_{x \sim D_i} [h(x) \neq f_i(x)] $$ Since the set $\{v_1,\ldots,v_p\}$ is a subset of $C$ we can write, $$ \frac{1}{2m} \sum_{x \in C} \indic_{h(x) \neq f_i(x)} \geq \frac{1}{2m} \sum_{r =1}^p \indic_{h(v_r) \neq f_i(v_r)} $$ Since $p \geq m$ it is the case that, $$ \frac{1}{2m} \sum_{r =1}^p \indic_{h(v_r) \neq f_i(v_r)} \geq \frac{1}{2p} \sum_{r =1}^p \indic_{h(v_r) \neq f_i(v_r)} $$
If we replace $h$ with $A(S^i_j)$ we have, $$ L_{D_i}(A(S^i_j)) = \frac{1}{2m} \sum_{x \in C} \indic_{A(S^i_j)(x) \neq f_i(x)} \geq \frac{1}{2p} \sum_{r =1}^p \indic_{A(S^i_j)(v_r) \neq f_i(v_r)} $$ For all the values $\indic_{A(S^i_j)(v_r) \neq f_i(v_r)}$ there exists an $r$ which generates the minimum value. We can therefore write, $$ \frac{1}{2p} \sum_{r =1}^p \indic_{A(S^i_j)(v_r) \neq f_i(v_r)} \geq \frac{1}{2} \min_{r \in [p]} \indic_{A(S^i_j)(v_r) \neq f_i(v_r)} $$ and so, $$ L_{D_i}(A(S^i_j)) \geq \frac{1}{2} \min_{r \in [p]} \indic_{A(S^i_j)(v_r) \neq f_i(v_r)} $$ If we average this last expression over the range of all possible functions $[T]$ we obtain, $$ \frac{1}{T} \sum_{i=1}^T L_{D_i}(A(S^i_j)) \geq \frac{1}{2} \min_{r \in [p]} \frac{1}{T} \sum_{i=1}^T \indic_{A(S^i_j)(v_r) \neq f_i(v_r)} $$ The problem is resolving, $$ \min_{r \in [p]} \frac{1}{T}\sum_{i=1}^T \indic_{A(S^i_j)(v_r) \neq f_i(v_r)} $$ which at first sight does not seem to equal anything simple. We have selected a value of $r$ and average over
all the possible functions used for labelling element $v_r$. Recall that each function $f_i$ is just a different combination of pairs $(x,y)$; given $x=v_r$ some functions have $y=1$ and the rest have $y=0$. What fraction of the $T$ possible functions maps $v_r$ to $1$? Clearly $\frac{T}{2}$ must map $v_r$ to $1$ and the rest to $0$ ($T$ is always an even number). Now lets assume wlog that $A(S^i_j)(v_r)=1$. Then, $$ \frac{1}{T}\sum_{i=1}^T \indic_{A(S^i_j)(v_r) \neq f_i(v_r)} = \frac{1}{T} \frac{T}{2} =\frac{1}{2} $$ and so we conclude that, $$ \frac{1}{T} \sum_{i=1}^T L_{D_i}(A(S^i_j)) \geq \frac{1}{4} $$
We have already shown that, $$ \max_{i \in [T]} \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) \geq \min_{j \in [k]} \frac{1}{T} \sum_{i=1}^T L_{D_i}(A(S^i_j)) $$ Since the last inequality holds for all values of $j$ we can write, $$ \max_{i \in [T]} \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) \geq \frac{1}{4} $$ Can we make sense of the expression on the l.h.s.? First, $$ \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) $$ We know that each training set $S^i_j$ has an equal probability of being generated ($D_i$ assigns equal probability to all $x\in S_j$). Therefore, $$ \expect_{D_i}[L_{D_i}(A(S^i_j))] = \frac{1}{k} \sum_{j=1}^k L_{D_i}(A(S^i_j)) $$ is the expected generalisation error of an algorithm $A$ trained with data generated from a distribution $D_i$.
If we then write, $$ \max_{i \in [T]} \expect_{D_i}[L_{D_i}(A(S^i_j))] \geq \frac{1}{4} $$ it is equivalent to the statement that there is always some task (denoted by $\max_{i\in [T]}f_i$) for which we can calibrate an algorithm using training data generated with correct labels for which the algorithm has a generalisation error greater than 25%. Therefore it is not possible to find an algorithm that performs equally well for all tasks.