Bayesian classifier and linear regression on dummy variables

25 Views Asked by At

EDIT: I am finally not so sure about one thing: they say regression but not linear regression. I may have misunderstood the whole paragraph.

In the book Elements of Statistical Learning, (Hastie-Tibshirani-Friedman), on page 22, just before paragraph 2.5, they say that for the task of classification, the so-called method Bayes classifier and the classification by largest fitted value of linear regression on dummy variables give the same result. Can you tell me what is the precise statement and where to find a proof?

Let me give you a few details on the context, in order to make the question almost self-contained and to make sure I'm understanding the text correctly.

Let $p$ be an integer, $\mathcal{G}$ be a finite set, and $\mu$ be a probability measure on $\mathbb{R}^p \times \mathcal{G}$ that, for simplicity, we assume to be finitely-supported. Let $A$ be the image of the support of $\mu$ by the first coordinate map.

The Bayes classifier method is a map $\hat{G} : \mathbb{R}^p \rightarrow \mathcal{G}$ such that for all $x \in A$, $\hat{G}(x) \in argmax (g \mapsto \mu(\{x,g\}))$.

A classification by largest fitted value of the linear regression on dummy variables is defined in a slightly more difficult way: we consider a linear map $\hat{\beta} : \mathbb{R}^{p+1} \rightarrow \mathbb{R}^{\mathcal{G}}$ which is a minimizer of the functional $\beta \mapsto EPE(\beta) := \int_{\mathbb{R}^p \times \mathbb{R}^{\mathbb{G}}} \sum_{g' \in \mathcal{G}}\left(\textbf{1}_{g' = g} - \beta(x,1)(g)\right)^2 d\mu(x,g)$ and we consider a map $\hat{G}$ such that for all $x \in A$, $\hat{G}(x) \in argmax (g \mapsto \hat{\beta}(x,1)(g))$.

Is it true that (apart from possible non-uniqueness of the two $argmax$ and of the $argmin$ of $EPE$) that the two $\hat{G}$ are equal? Is it true that for all $g \in \mathcal{G}$, and all $x \in A$, $\mu(\{x,g\}) = \hat{\beta}(x,1)(g)$?