Minimizer of $E[(1-Yf(x))_+]$, where $Y \in \{-1,1\}$

133 Views Asked by At

My goal is to prove the following statement;

$Y \in \{-1,1\}$ and $p(x) = P(Y=1|X=x)$. Let $z_+ = zI(z \ge 0)$. Then, show that the minimizer $\hat f$ of $E[(1-Yf(X))_+|X=x]$ satisfies $\text{sign}(\hat f) = \text{sign}(p(x) - 1/2)$ .


Let $f = f(x)$ and $p = p(x)$ for simplicity.

I tried to divide into 4 cases.

In the first case $1-f \ge 0$ and $ 1+f \ge 0$, $$E[(1-Yf)_+] = (1-f)p + (1+f)(1-p) = 1+(1-2p)f$$ When $1-2p \le 0$, I claimed that $f=1$ minimizes the expectation, and when $1-2p \ge 0$, $f = -1$ minimizes it.

(I'm not sure whether this approach is correct; it seems $f$ is regarded as constant, which is not guaranteed...)

Also, the second case $1-f \ge 0$ and $ 1+f \le 0$ shows $$E[(1-Yf)_+] = (1-f)p.$$ Based on my above approach, the expectation would be minimized when $f = -1$ regardless of the sign of $(p-1/2)$.

This seems strange, so I'm stuck on here...
In particular, I'm not sure how to use $z_+$ properly in this case.

Any help with respect to my problem would be appreciated. Thank you.


editted cont'd to the answer

I assumed you intended to show that when $f \ge 1$, $y = \text{sign}(f)$ leads to $1-Yf = 1-f \le 0$ and thus $(1-Yf)_+ = 0$, and $y=-\text{sign}(f)$ leads to $1-Yf \ge 2$.

Now, everything is clear, except for the wlog ones; $f \in [-1,1]$. How does the above fact make $f \in [-1,1]$ sense? I mean, what about the cases where $|f| = \infty$?

1

There are 1 best solutions below

7
On

First, some intuition about the problem : notice that $f(x)$ is $\{-1,1\}$-valued, and $Yf(X)$ is equal to $1$ exactly when $Y=f(X)$ while it will be equal to $-1$ when $Y\ne f(X)$. Hence the expression $(1-Yf(X))_+$ will be equal to $0$ when $f(X)$ correctly predicted $Y$ while it will be equal to $2$ when $f$ "made a mistake".

In other words, you're looking at a binary classification problem with hinge loss $\ell : \tilde f(x)\mapsto \mathbb E\left[(1- Y\tilde f(X))_+\mid X =x\right] $ as a surrogate. And you're trying to show that for any given $X=x$ the best possible decision rule is given by $\tilde f(x) \equiv f(x)$. Indeed, $f$ as you've defined it is known as the Bayes classifier and is known to be optimal for the so-called 0-1 loss $ \ell_{01} : \tilde f(x)\mapsto \Pr\left\{\tilde f(X) = Y\mid X=x\right\}$.
If you think about it, it should make sense : $f$ essentially picks the class which has the greatest probability of occurence given $X=x$, so if that is all the available information, we expect it to be the best possible decision to make.


Now for the prof that $f$ is optimal for the hinge loss $\ell$, we can proceed as follows : first notice that any candidate $\tilde f :\mathcal X \to \mathbb R$ (here, $\mathcal X$ denotes the support of $X$) can be restricted w.l.o.g. to take values in the interval $[-1,1]$. Indeed, if, say $\tilde f(x)\ge 1$ then either $Y=\text{sign}\big(f(x)\big)$ in which case $1-Yf(x) \ge 2$ (i.e. the loss increases), or $Y=-\text{sign}\big(f(x)\big)$ in which case $\big(1-Yf(x)\big)_+ = 0$ (i.e. $\tilde f(x)$ and $\tilde f(x) \wedge 1$ have the same loss). The case $\tilde f(x)\le -1$ is handled similarly.

Now let $\tilde f :\mathcal X \to [-1,1]$ be any function. Observe that $1- Y\tilde f(X)$ is always non-negative so we can drop the $(\cdot)_+$ around it, hence $$\mathbb E\left[(1- Y\tilde f(X))_+\mid X =x\right] = \mathbb E\left[1- Y\tilde f(X)\mid X =x\right] = 1 - \mathbb E\left[Y\tilde f(X)\mid X =x\right] $$ So to minimize $\ell$ it is enough to maximize $ \mathbb E\left[Y\tilde f(X)\mid X =x\right]$. But by the (conditional) law of total expectation we have \begin{align} \mathbb E\left[Y\tilde f(X)\mid X =x\right] &= E\left[\tilde f(X)\mid X =x\right]\Pr(Y=1\mid X = x)\\ &+ E\left[-\tilde f(X)\mid X =x\right]\Pr(Y=-1\mid X = x)\\ &= \tilde f(x)(p(x) - (1-p(x))\\ &= \tilde f(x)(2p(x) - 1) \le \left|\tilde f(x)\right||2p(x) - 1| \end{align}

Furthermore, the equality is precisely achieved when $\text{sign}\left(\tilde f(x)\right) \equiv \text{sign}(2p(x) -1) = \text{sign}(p(x) -1/2)$, which is exactly what we wanted to show.