Why is non-uniformly learnable hypotheses class a countable union of uniformly learnable ones?

232 Views Asked by At

The question concerns the proof of theorem 7.2 in Understanding Machine Learning by Shalev-Schwartz & Ben-David. The authors argue in the following way: suppose $\mathcal{H}$ is non-uniformly learnable, that is,

Def: There exists an algorithm $A$ an a function $m_{\mathcal{H}}^{NUL}:(0,1)^2 \times \mathcal{H} \rightarrow \mathbb{N}$ such that for any $\epsilon, \delta\in (0,1)$ and any $h\in\mathcal{H}$, if $m\geq m_{\mathcal{H}}^{NUL}(\epsilon, \delta, h)$, then for any distribution $\mathcal{D}$ over the sample space, if the size of training sample $S$ is equal to $m$, then with probability of at least $1-\delta$ we have $$L_{\mathcal{D}}(A(S)) \leq L_{\mathcal{D}}(h) + \epsilon,$$ where $L_{\mathcal{D}}$ denotes the true risk, i.e. the probability of misclassification.

Now, the theorem concerned says that the class is non-uniform learnable if and only if it is a countable union of hypotheses with a finite VC-dimension. The proof in the book defines $$\mathcal{H}_n = \{h\in \mathcal{H}:m_{\mathcal{H}}^{NUL}(1/8, 1/7, h) \leq n\}.$$ Obviously, $\mathcal{H}$ is a union of $\mathcal{H}_n$. However, it is not clear to me why do $\mathcal{H}_n$ have finite VC-dimension. The proof invokes Fundamental Theorem of Statistical Learning which provides the lower bound for the required training sample guaranteeing $(\epsilon, \delta)$ accuracy-confidence but this theorem assumes the finite VC-dimension, which is precisely what we want to show. Therefore, the argument on the surface appears to be circular. It is also not clear to me why would it break down if I replaced $(1/8, 1/7)$ with $(1/2, 1)$.

Any insight would be much appreciated!

1

There are 1 best solutions below

0
On

Statement: $VC(\cal{H}_n) \leq 2n$.

Proof:

  1. By contradiction, assume it is larger than $2n$, i.e., there exist $2n$ points ${\cal Y}^{2n} \subseteq \cal{X}$, which can be shattered by $\cal{H}_n$.
  2. The no-free-lunch theorem says, that for the non-uniformly learning algorithm $A^{\text{NUL}}$ (from the definition of the non-uniform learnability) there exists a distribution ${\cal D}_A$ over ${\cal Y}^{2n}$, for which with the probability of at least 1/7 over the choice $S \sim {\cal D}_A^n$ it holds that ${\cal L}_{{\cal D}_A}(A(S)) > \frac18$.
  3. Since ${\cal Y}^{2n}$ is shattered by ${\cal {H}}_n$, there exists a hypothesis $h'$ such that ${\cal L}_{{\cal D}_A}(h') = 0$. The non-uniform learnability definition w.r.t. $h'$ says, in particular, that with the probability of at least 6/7 over the choice $S \sim {\cal D}_A^n$, it holds that ${\cal L}_{{\cal D}_A}(A(S)) \leq {\cal L}_{{\cal D}_A}(h') + \frac18 = \frac18$, what contradicts the statement in (2).