I am studying a linear regression example for machine learning. It makes the following definition:
As the name implies, linear regression solves a regression problem. In other words, the goal is to build a system that can take a vector $\mathbf{x} \in \mathbb{R}^n$ as input and predict the value of a scalar $y \in \mathbb{R}$ as its output. The output of linear regression is a linear function of the input. Let $\hat{y}$ be the value that our model predicts $y$ should take on. We define the output to be
$$\hat{y} = \mathbf{w}^T \mathbf{x}$$
where $\mathbf{w} \in \mathbb{R}^n$ is a vector of paramters.
Parameters are values that control the behaviour of the system. In this case, $w_i$ is the coefficient that we multiply by feature $x_i$ before summing up the contributions from all the features. We can think of $\mathbf{w}$ as a set of weights that determine how each feature affects the prediction. If a feature $x_i$ receives a positive weight $w_i$, then increasing the value of that feature increases the value of our prediction $\hat{y}$.
It then says the following:
It is worth noting that the term linear regression is often used to refer to a slightly more sophisticated model with one additional parameter -- an intercept term $b$. In this model
$$\hat{y} = \mathbf{w}^T \mathbf{x} + b,$$
so the mapping from parameters to predictions is still a linear function but the mapping from features to predictions is now an affine function. This extension to affine functions means that the plot of the model's predictions still looks like a line, but it need not pass through the origin. Instead of adding the bias parameter $b$, one can continue to use the model with only weights but augment $\mathbf{x}$ with an extra entry that is always set to $1$. The weight corresponding to the extra $1$ entry plays the role of the bias parameter.
This is the first part that I have a question about:
so the mapping from parameters to predictions is still a linear function but the mapping from features to predictions is now an affine function.
Can someone please clarify this more clearly?
This is the second part I have a question about:
Instead of adding the bias parameter $b$, one can continue to use the model with only weights but augment $\mathbf{x}$ with an extra entry that is always set to $1$.
So the vector $\mathbf{x}$ would just have one additional element (a $1$ value) at the end? And this means we can avoid the bias parameter and just have $\hat{y} = \mathbf{w}^T \mathbf{x}$?
Thank you.
Note that the parameters are $(w,b)$, hence we have
$$\hat{y}=(w^T, b)\begin{bmatrix} x \\ 1\end{bmatrix}$$
which is linear with respect to the parameter.
However, the original features is just $x$.
$$\hat{y}=w^Tx+b$$
There is a translation by $b$ away from the origin. Hence it is affine.
For your second question, you can either append the last entry to be $1$ or first entry to be $1$ as long as you are consistent.
$$\hat{y}=(w^T, b)\begin{bmatrix} x \\ 1\end{bmatrix}= (b, w^T)\begin{bmatrix}1\\ x \end{bmatrix}$$
and reduce the analysis to the earlier case.