How does the classification using the 0-1 loss matrix method work?

Question

How does the classification using the 0-1 loss matrix method work?

1.6k Views Asked by user400242 At 10 May 2026 - 11:18

In this machine learning lecture the professor says:

Suppose $\mathbf{X}\in\Bbb R^p$ and $g\in G$ where $G$ is a discrete space. We have a joint probability distribution $\Pr(\mathbf{X},g)$.

Our training data has some points like:

$(\mathbf{x_1},g_1)$, $(\mathbf{x_2},g_2)$, $(\mathbf{x_3},g_3)$ ... $(\mathbf{x_n},g_n)$

We now define a function $f(\mathbf{X}):\Bbb R^p \to G$.

The loss $L$ is defined as a $K\times K$ matrix where $K$ is the cardinality of $G$. It has only zeroes along the main diagonal.

$L(k,l)$ is basically the cost of classifying $k$ as $l$.

An example of $0-1$ loss function:

\begin{bmatrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{bmatrix}

$\text{EPE}(\hat{f}) = \text{E} [L(G,\hat{f})]$ (where $\text{EPE= Expected Prediction Error}$)

$=E_\mathbf{X} E_{G/\mathbf{X}} \{L[G,\hat{f}]|\mathbf{X}\}$

$\hat{f}(\mathbf{x})=\text{argmin}_g\sum_{k=1}^{k}L(k,g)\text{Pr}(k|\mathbf{X}=\mathbf{x})=\text{argmax}_g\text{Pr}(g|\mathbf{X=x})$

$\hat{f}(\mathbf{x})$ is the Bayesian Optimal Classifier.

I couldn't really follow what the professor was trying to say in some of the steps.

My questions are:

Suppose our loss matrix is indeed: \begin{bmatrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $\mathbf{x_i}$ from the matrix?
I couldn't understand what $\hat{f}$ and $\text{EPE}(\hat{f}\mathbf{(x)})$ stand for. Could someone please explain it with a simple example?

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2018-01-27 12:52:18

Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.

Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.

It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?

Now, define $\hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.

Could we do that any better?

Can you identify the current loss matrix?

Can you compute the expected loss belonging to the method given above?

Can you create a similar problem so that the loss matrix is the one given in the OP?

**Bumbble Comm** · Answer 2 · 2018-12-27 17:18:38

I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K \times K$.

In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.

In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,\hat{G}(X))]$$ , where $G$ and $\hat{G}(X)$ within $L(G, \hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, \hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.

Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $\hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $\hat{G}(X)$. If $G = \hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.

If we only observed a $X$, not $G$, you are still able to find a $\hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k \in \{1, 2, ..., K\}$. Now we have:

$$\tag{*} E[L(G, \hat{G}(X))|X]=\sum\limits_{k=1}^K [L(g_{k}, \hat{G}(X))Pr(g_{k}|X)]$$

The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is: $$EPE = E_{X}\left(\sum\limits_{k=1}^K [L(g_{k}, \hat{G}(X))Pr(g_{k}|X)]\right)$$

So we should pick the estimator $\hat{G}(X)$ which minimizes the above EPE.

This is my understanding. I think there must be answer more mathematically solid.

How does the classification using the 0-1 loss matrix method work?

There are 2 best solutions below

Related Questions in PROBABILITY

Related Questions in MATRICES

Related Questions in STATISTICS

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions