In this machine learning lecture the professor says:
Suppose $\mathbf{X}\in\Bbb R^p$ and $g\in G$ where $G$ is a discrete space. We have a joint probability distribution $\Pr(\mathbf{X},g)$.
Our training data has some points like:
$(\mathbf{x_1},g_1)$, $(\mathbf{x_2},g_2)$, $(\mathbf{x_3},g_3)$ ... $(\mathbf{x_n},g_n)$
We now define a function $f(\mathbf{X}):\Bbb R^p \to G$.
The loss $L$ is defined as a $K\times K$ matrix where $K$ is the cardinality of $G$. It has only zeroes along the main diagonal.
$L(k,l)$ is basically the cost of classifying $k$ as $l$.
An example of $0-1$ loss function:
\begin{bmatrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{bmatrix}
$\text{EPE}(\hat{f}) = \text{E} [L(G,\hat{f})]$ (where $\text{EPE= Expected Prediction Error}$)
$=E_\mathbf{X} E_{G/\mathbf{X}} \{L[G,\hat{f}]|\mathbf{X}\}$
$\hat{f}(\mathbf{x})=\text{argmin}_g\sum_{k=1}^{k}L(k,g)\text{Pr}(k|\mathbf{X}=\mathbf{x})=\text{argmax}_g\text{Pr}(g|\mathbf{X=x})$
$\hat{f}(\mathbf{x})$ is the Bayesian Optimal Classifier.
I couldn't really follow what the professor was trying to say in some of the steps.
My questions are:
Suppose our loss matrix is indeed: \begin{bmatrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $\mathbf{x_i}$ from the matrix?
I couldn't understand what $\hat{f}$ and $\text{EPE}(\hat{f}\mathbf{(x)})$ stand for. Could someone please explain it with a simple example?
Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.
Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.
It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?
Now, define $\hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.
Could we do that any better?
Can you identify the current loss matrix?
Can you compute the expected loss belonging to the method given above?
Can you create a similar problem so that the loss matrix is the one given in the OP?