Loss functions for Regression task

246 Views Asked by At

I am trying to understand the idea of Loss functions For Regression Task perfectly.

I have read many textbooks and articles, and I came up with questions related to this subject.

Several different uses of loss functions can be distinguished.

(a) In prediction problems: a loss function depending on predicted and observed value defines the quality of a prediction.

(b) In estimation problems: a loss function depending on the true parameter and the estimated value defines the quality of estimation.

(c) Many estimators (such as least squares or M-estimators) are defined as optimizers of certain loss functions which then depend on the data and the estimated value.

Now, since my focus is on Loss Functions For Regression Task

($y_i=\theta_0 +\theta_1x_{i1}+\dots + \theta_px_{ip} +\epsilon_i$ ,y is the dependent variable and x is the independent one)

My questions are as follows.

  1. Should I write the loss function formula as a function of the parameter or of the variables ($\mathrm{L}\big(\theta,\hat\theta\big)$ or $\mathrm{L}\big(y,\hat y\big)$ )?

  2. Should I consider the Loss function formula for one point or Not (with sums or not)?

Note

My thought is to introduce Loss function first and then to use the standard notation for all Loss functions (least square, absolute value and Huber Loss, Quntile Loss and so on).

UPDATED

I did the following but I am not sure

L2 Loss

$$ \mathrm{L}\big(\{y_{(i)}, \hat{y}_{(i)}\}_{i=1}^n\big) = \mathrm{\sum}_{i=1}^n\big(y_i-\hat y_i\big)^2;$$

L1 Loss

$$ \mathrm{L}\big(\{y_{(i)}, \hat{y}_{(i)}\}_{i=1}^n\big) = \mathrm{\sum}_{i=1}^n|y_i-\hat y_i|;$$

Huber Loss

$$ \mathrm{L}\big(\{y_{(i)}, \hat{y}_{(i)}\}_{i=1}^n\big) = \begin{cases} \frac{1}{2}\mathrm{\sum}_{i=1}^n(y_i-\hat y_i)^2 & |y_i-\hat y_i| \leq \delta \\ \delta\mathrm{\sum}_{i=1}^n|y_i-\hat y_i|-\frac{1}{2}\delta^2 & \text{otherwise} \end{cases} $$

Log-Cosh Loss

$$\mathrm{L}\big(\{y_{(i)}, \hat{y}_{(i)}\}_{i=1}^n\big)= \log[\cosh\mathrm{\sum}_{i=1}^n(y_i-\hat y_i)];$$

Quantile Loss

$$\mathrm{L}\big(\{y_{(i)}, \hat{y}_{(i)}\}_{i=1}^n\big) = \big[\tau \mathrm{\sum}_{i=1}^n|y_i-\hat y_i| + (1 + \tau)\mathrm{\sum}_{i=1}^n(y_i-\hat y_i)\big].$$

1

There are 1 best solutions below

2
On

Your question seems to be mostly about getting the notation straight. Below let me introduce regression problem and the relevant notations. I think it should clarify the notation and so answer your questions.


Problem introduction:

In general (for linear regression) your assumption is that $y$ (dependent) and $x$ (independent) are related by (plus some noise, but let's drop the noise part):

$$ y = \theta_1 x_1 + \theta_1 x_2 + \cdots + \theta_p x_p. $$ Now, $\{\theta_i\}_{i=1}^p$ are unknown but you're given $n$ observations (a sample) $\{y^{(i)}, x^{(i)}\}$ where $ x^{(i)} \in \mathbb{R}^p$ and $y^{(i)} \in \mathbb{R}$.

In order to recover the unknown parameters $\{\theta_i\}_{i=1}^p$, you would want to pick $\{\hat{\theta}_i\}_{i=1}^p$ them so that you simultaneously for all $i=1, \ldots, n$ recover

$$ y^{(i)} \approx \hat{y}^{(i)} := \hat{\theta}_1 x^{(i)}_1 + \hat{\theta}_2 x^{(i)}_2 + \cdots + \hat{\theta}_p x^{(i)}_p. $$


How the Loss function factors in:

How do you measure that $y_i \approx \hat{y}$? Well, this depends on your taste. But generally you're given a loss function $$\mathrm{Loss}\big(\{y^{(i)}, \hat{y}^{(i)}\}_{i=1}^n\big),$$ that defines for you what is meant by "$\approx$". If you prefer vector notation i.e., you set $$y = (y^{(1)}, \ldots, y^{(n)}),\quad\quad \theta:= (\theta_1, \ldots, \theta_p),$$ and you set $X \in \mathbb{R}^{n\times p}$ a matrix given by $$X_{ij} = x^{(i)}_j.$$ Then $$ \hat{y} = X \theta. $$ Then in vector form the loss is $$\mathrm{Loss}\big(\{y^{(i)}, \hat{y}^{(i)}\}_{i=1}^n\big) \equiv \mathrm{Loss}\big(y, \hat{y}\big).$$


What is $\hat{\theta}$ really:

How does finding $\hat{\theta}$ work?

Now that you have your loss you want to minimize $$ \hat{\theta} := \arg \min_{\theta \in \mathbb{R}^p} \mathrm{Loss}(\big(\{y^{(i)} , \hat{y}^{(i)}\}_{i=1}^n\big)) \underbrace{ \equiv}_{*} \mathrm{Loss}\big(y, \hat{y}\big) \underbrace{ \equiv}_{**} \mathrm{Loss}\big(\theta; \{y^{(i)}, x^{(i)}\}_{i=1}^n\big) \equiv \mathrm{Loss}\big(\theta; y, X\big), $$ where the notation on the right hand side of $(*)$ is where you treat the sample in a vector form, and on the right hand side of $(**)$ is just a way to stress that $\{y^{(i)}, x^{(i)}\}_{i=1}^n$ are fixed (that's why they appear after ";") and so $\hat{y}^{(i)}, i=1, \ldots, n$ are just functions of ${\theta}_j, j=1, \ldots, p$. That's why minimizing over ${\theta}_j$ makes sense- those are the only free parameters, all else is fixed.

Overall, when you solve the minimization problem from above you obtain $\hat{\theta}$ as a function of the observed $\{y^{(i)}, x^{(i)}\}_{i=1}^n$. So you can write

$$ \hat{\theta} := \hat{\theta}\big(\{y^{(i)}, x^{(i)}\}_{i=1}^n\big) \equiv \hat{\theta}(y, X). $$


A concrete example:

Use matrix notation and take a loss to be $$\mathrm{Loss}(y, \hat{y}) := \| y - \hat{y} \|_2,$$ with the use of matrix notation you have $$ {y} = X \theta, $$
So you want to run minimization: $$ \hat{\theta} := \arg \min_{\theta \in \mathbb{R}^p} \|y - X \theta\|_2^2.\tag{***} $$ Where, to reiterate, $$y = (y^{(1)}, \ldots, y^{(n)}),$$ $$\theta:= (\theta_1, \ldots, \theta_p),$$ and $X \in \mathbb{R}^{n\times p}$ a matrix given by $$X_{ij} = x^{(i)}_j.$$

In matrix form $(***)$ is just $$ \hat{\theta} := \arg \min_{\theta \in \mathbb{R}^p}\left\| \begin{pmatrix} y^{(1)} \\ \ldots \\ y^{(n)} \end{pmatrix} - \begin{pmatrix} x^{(1)}_1 & x^{(1)}_2 & \cdots & x^{(1)}_p \\ \cdots & \cdots & \cdots & \cdots \\ x^{(1)}_1 & x^{(n)}_2 & \cdots & x^{(n)}_p \end{pmatrix} \begin{pmatrix} \theta_1 \\ \ldots \\ \theta_p \end{pmatrix} \right\|_2^2, $$

it has a closed form solution (taking derivative with respect to $\theta$ etc...) given by

$$ \hat{\theta} = (X^\top X)^{-1}X^\top y. $$

We can really verify that $\hat{\theta} $ is function of $y, X$. Indeed: $$ \hat{\theta} = \hat{\theta}(y, X) := (X^\top X)^{-1}X^\top y. $$