When will empirical risk minimization with inductive bias fail>

460 Views Asked by At

I am working on the assignment and I am stucked on this problem:

Give an example of a class H, some domain space X, a distribution P over X × { 0, 1}, and an ERMH learning algorithm, A, such that for some h*∈ H, for every sample size m, h* is a much better hypothesis than the ERM one picked by A. That is, E[Lp(A(s)) >= Lp(h*) + 1/2]

I have no idea how to do that. Can anybody give me some hints?

1

There are 1 best solutions below

0
On BEST ANSWER

The first observation is that if H has finite VC dimension , then empirical risk converge to the true risk as the sample size grows .Therefore ERM generates better and better hypothesis . This suggest to look for some set with infinite VC dimension . Regarding the distribution , to make things easy for us and hard for ERM , we should look for some uniform distribution.

An example that I think it works is the following :

  1. $X = [0,1]$
  2. H is the set of all functions $h:X \rightarrow \{0,1\}$ . H clearly has infinite VC dimension
  3. $c \in H$ some arbitrary fixt function
  4. P is the distribution induced by the uniform distribution over X and $c$

Since $c \in H$ we can take $h^{*} = c$ , so the true risk is $L_{P}(h^{*}) = 0$

The sample always has a 0 measure support on X (it's a finit set) . The algorithm A will pick a hypothesis with 0 empirical risk , but since it does not know anything about the values that the function c has on the rest of the domain X , it can't perform better than random guessing and it's true risk will be 1/2 .

Let $P_{H}$ be the uniform distrubution over H - $\forall x \in X ,f\sim P_{H} , P(f(x)=0)= P(f(x)=1)=1/2$. For a sample $S$ we can define a new probability distribution $P_{HS}$ by conditioning on the values of the sample : $\forall x_{i} \in S f(x_{i}) = c(x_{i})$ .This ,combined with the fact that support(S) has 0 measure leads to $E_{x\sim P}[E_{h\sim P_{HS}}[A(S)(x) \ne h(x)]] = 1/2$ . Therefore must exist at least a hypothesis h that agrees with the one provided by the algorithm A, but is differnent on the rest of the domain X . Letting c be this hypothesis, we have that $L_{P}(A(S)) = 1/2$