A question about dual representations, kernels and notations used in Bishop's book.

83 Views Asked by At

I'm having hard times about kernel functions and dual representations on 'Pattern recognition from and machine learning' by Bishop. Here it is the page I'm trying to understand:enter image description here

Of course setting the gradient of $J(\textbf{w})$ to $\textbf{0}$ and solving for $\textbf{w}$ leads to the (6.3). Now he writes about the design matrix $\Phi$.

For me (without initially considering nonlinear transformations of dataset), the design Matrix $\textbf{X}$ is a $N \times D$ matrix with ($n$- number of samples, $D$- number of components/features)

$$\textbf{X} = \begin{bmatrix}x_1^{(1)} & \dots & x_1^{(D)}\\ \vdots & & \vdots \\ x_n^{(1)} \quad & \dots & \quad x_n^{(D)} \end{bmatrix}$$

Considering a (possibly nonlinear) transformation $\phi: \, \mathbb{R}^D \rightarrow \mathbb{R}^K$ such that $$ \mathbb{R}^D \ni \textbf{x}=(x^{(1)}, \dots , x^{(D)}) \mapsto \begin{bmatrix}\phi_1(\textbf{x}), \dots , \phi_K(\textbf{x}) \end{bmatrix}$$

I'm assuming the application $\phi$ takes a ROW-vector and brings a ROW vector because I originally assumed that the $D$- dimensional features of matrix $\textbf{X}$ were rows. Is this correct?If yes, why? This reasonment would bring me considering the design matrix $\Phi$ as the $N \times K$ matrix

$$\Phi = \begin{bmatrix}\phi_1(x_1) & \dots & \phi_K(x_1)\\ \vdots & \dots & \vdots \\ \phi_1(x_N) & \dots & \phi_K(x_N) \end{bmatrix}$$

So why does he write that $n$-th row corresponds to $\bf \phi(\textbf{x}_n)^T$?

Also how the hell does he obtain the 6.5 by substitution? I really can't figure it out. Thank you for your help.