What is the optimal classification rule among all rules that are function of $\|x\|^2$?

211 Views Asked by At

enter image description hereI have a textbook problem which I'm not quite sure how to solve:

Suppose that you observe $(x_1,y_1),...(x_{100}, y_{100})$, which you assume to be i.i.d. copies of a random pair $(x, y)$ taking values in $\mathbb{R}^2 \times \{1,2\}$. Further, suppose that you observe $x$, and you would like to predict $y$. The data looks like the following: enter image description here Given that the data is rotationally symmetric and $\operatorname{Pr}(y=1)=\tfrac{1}{2}$, and

$$ \big[\|x\|^2\mid y=1\big] \sim \text{Exp}(\tfrac{1}{2}) \qquad\text{and}\qquad \big[\|x\|^2\mid y=2\big] \sim \text{Unif}([9,16])$$

What is the optimal classification rule among all rules that are functions of $\|x\|^2$?

Also, how do I show that the expected cost of this classification rule is equal to $\frac{1}{2}(e^{-9/2}-e^{-8})$? The misclassification costs are equal to $c_1=c_2=1$.

So, in the textbook i'm using, the expected cost of misclassification is defined as:

Suppose we use the classification rule $g:\mathbb{R}^p\rightarrow \{1,2\}$, that assigns to group $1$ when $x \in R_1$ adnto group $2$ when $x\in R_2$. The expected cost of misclassification associated to the rule $g$ is $$\mathbb{E}[\text{cost}(Y,g(X))]=c_2\mathbb{P}(x\in R_1 | Y=2)\pi_2+c_1\mathbb{P}(x\in R_2 | Y=1)\pi_1$$ Where $\pi_1=\mathbb{P}(Y=1|x)$ and $\pi_2=\mathbb{P}(Y=2|x)$

My attempt: We have that

$$\begin{aligned} R_1:&=\{x: \int_{|X|^2|Y=1}(x|y=1)dx\gt \int_{|X|^2=2}(x|y=2)dx\} \\ &=\{x: \tfrac{1}{7} \gt \tfrac{1}{2}e^{\frac{1}{2}x}\} \\ &=\{x: \log(\tfrac{49}{4})\lt x\} \\ R_2:&=\{x: \log(\tfrac{49}{4}) \gt x\} \end{aligned}$$

So,

$$ \mathbb{E}[\text{cost}(y, g(x))] =\tfrac{1}{2}\int_{R_1}\int_{|X|^2|Y=2}(x)dx+\tfrac{1}{2}\int^{R_2}\int_{|X|^2|Y=1}(x)dx $$

The second integrand is equal to $0$, and the first integrand is equal to:

$$ \tfrac{1}{2}\int^{16}_9\frac{e^{-\frac{1}{2}x}}{2}dx =\tfrac{1}{2}[e^{-\frac{-1}{2}x}]^{16}_9=\frac{e^{\frac{9}{2}}-e^8}{2} $$

Would this be correct?

1

There are 1 best solutions below

1
On BEST ANSWER

Your $R_1$ and $R_2$ are definitely not correct, and there are overall a lot of problems with your attempt:

  1. $\{x: \tfrac{1}{7} \gt \tfrac{1}{2}e^{\frac{1}{2}x}\}$ and $\{x: \log(\tfrac{49}{4})\lt x\}$ are nonsensical statements since $x\in\mathbb R^2$. You probably mean $\|x\|$ instead?
  2. In this case, the statement $\{x: \tfrac{1}{7} \gt \tfrac{1}{2}e^{\frac{1}{2}\|x\|}\}=\{x: \log(\tfrac{49}{4})\lt \|x\|\}$ is also wrong, you miscalculated there. This should be obvious though, since $\log(49/4)\approx 2.5$, but the decision boundaries should quite obviously be at $\|x\|=3$ and $\|x\|=4$.
  3. The statement $\mathbb{E}[\text{cost}(y, g(x))] =\tfrac{1}{2}\int_{R_1}\int_{|X|^2|Y=2}(x)dx+\tfrac{1}{2}\int^{R_2}\int_{|X|^2|Y=1}(x)dx$ does not make any sense semantically. What are you integrating? What are you integrating over?
  4. How do you get from $\tfrac{1}{2}\int_{R_1}\int_{|X|^2|Y=2}(x)dx$ to $\tfrac{1}{2}\int^{16}_9\frac{e^{-\frac{1}{2}x}}{2}dx$ ?? Where did the $\log(\tfrac{49}{4})$ go??
  5. While in the very end, you somehow get the right numbers, it is not clear what your argument is for your choice of $R_1$ and $R_2$ being optimal, and that no cost below $\tfrac{1}{2}(e^{\frac{9}{2}}-e^8)$ is achievable.

I would recommend the following:

  1. Try to write everything down in a clean fashion
  2. The problem is rotationally symmetric $\leadsto$ express $x$ in polar coordinates