Is "nonuniform learnability" only a necessity condition for "PAC learnability"?

208 Views Asked by At

When reading "Understanding Machine Learning" by Shai Shalev-Shwartz and Shai Ben-David, I encountered confusion about the relationship between nonuniform (NU) learnability and PAC learnability. The notations below follow "Understanding Machine Learning."

To me, NU learnability is a sufficient condition of PAC learnability. But it is not. (e.g. NU learnability is a strict relaxation of PAC learnability, shown in p. 85 Example 7.1 ) I believe this is just a simple confusion.

Here are the definitions of those two learnabilities.

【Nonuniform learnability】 A hypothesis class $\mathcal{H}$ is nonuniformly learnable if there exist a learning algorithm, $A$, and a function $m_{\mathcal{H}}^{\mathrm{NUL}} :(0,1)^{2} \times \mathcal{H} \rightarrow \mathbb{N}$ such that, for every $\epsilon, \delta \in(0,1)$ and for every $h \in \mathcal{H}$, if $m \geq m_{\mathcal{H}}^{\mathrm{NUL}}$ then for every distribution $\mathcal{D}$, with probability of at least $1-\delta$ over the choice of $S \sim \mathcal{D}^{m}$, it holds that $L_{\mathcal{D}}(A(S)) \leq L_{\mathcal{D}}(h)+\epsilon$.

【PAC learnability】 A hypothesis class $\mathcal{H}$ is PAC learnable if there exist a learning algorithm,$A$, and a function $m_{\mathcal{H}} :(0,1)^{2} \rightarrow \mathbb{N}$ such that, for every $\epsilon, \delta \in(0,1)$ and for every distribution $\mathcal{D}$, if $m\geq m_{\mathcal{H}}(\epsilon, \delta)$, then with probability of at least $1−\delta$ over the choice of $S \sim \mathcal{D}^{m}$ it holds that $L_{\mathcal{D}}(A(S)) \leq \min _{h^{\prime} \in \mathcal{H}} L_{\mathcal{D}}\left(h^{\prime}\right)+\epsilon$.

Here is my proof showing that if $\mathcal{H}$ is NU learnable, it is PAC learnable.

Suppose $\mathcal{H}$ is NU learnable, by definition, for every $h\in \mathcal{H}, \epsilon\in(0,1), \delta\in(0,1)$, we can define $m_{\mathcal{H}}^{\mathrm{NUL}}(\epsilon,\delta,h)$, which leads to $L_{\mathcal{D}}(A(S)) \leq L_{\mathcal{D}}(h)+\epsilon$ w.p. $1-\delta$. Obviously this is also the case with $h^*$ which gives a globally optimal solution. That is $h^*=\text{argmin}_{h\in\mathcal{H}} L_{\mathcal{D}}(h)$. So if you define $m_{\mathcal{H}}(\epsilon, \delta) := m_{\mathcal{H}}^{\mathrm{NUL}}(\epsilon,\delta, h^*)$, we can guarante $L_{\mathcal{D}}(A(S)) \leq \min _{h^{\prime} \in \mathcal{H}} L_{\mathcal{D}}\left(h^{\prime}\right)+\epsilon$ because $L_{\mathcal{D}}\left(h^{*}\right) = \min _{h^{\prime} \in \mathcal{H}} L_{\mathcal{D}}\left(h^{\prime}\right)$. This means $\mathcal{H}$ is PAC learnable. (proof end)

What is the problem with this proof?

1

There are 1 best solutions below

0
On

Your confusion comes around the definition of $h^*$. PAC learnability (as well as nonuniform learnability) is for any distribution $\mathcal{D}$. However, your $m_{\mathcal{H}}(\epsilon, \delta)$ depends on $\mathcal{D}$ because $h^*$ chanages depending on $\mathcal{D}$. So your $m_{\mathcal{H}}(\epsilon, \delta)$ is not appropriate to guarantee PAC learnability.