Multiclass classification by hand - how to use gradient descent?

65 Views Asked by At

I am learning logistic regression. Having learnt something about binary classification, I came across this article on multiclass classification:

Given a set of $9$ training data with $2$-dimensional inputs and their corresponding class labels as follows. $$ \begin{array} {|r|r|}\hline \text{dim\ }\text{input} & X_1 & X_2 & X_3 & X_4 & X_5 & X_6 & X_7 & X_8 & X_9 \\ \hline 1 & 1.35 & 0.82 & 0.06 & 3.63 & 2.64 & 4.78 & -0.68 & 0.41 & -0.60 \\ \hline 2 & 0.66 & 2.51 & 1.06 & 5.01 & 7.2 & 5.41 & 5.64 & 4.92 & 5.54 \\ \hline \text{label} & 1 & 1 & 1 & 2 & 2 & 2 & 3 & 3 & 3 \\ \hline \end{array} $$ .... Fit a Multi-Class Logistic Regression model to the training data using the algorithm of Gradient Descent$?$ Provided that the learning rate is set to be $0.05$, the number of training epoch is set to be $1$ and the initial model parameters are set as follows. $$ \begin{array} {|r|r|}\hline \text{class\ } \text{dim} & 0 & 1 & 2 \\ \hline 1 & 4.06 & -5.76 & 4.13 \\ \hline 2 & 1.11 & -8.45 & 1.16 \\ \hline 3 & -6.31 & 8.28 & -3.73 \\ \hline \end{array} $$

I can't understand how the second table follows from the first using Gradient descent. The so called "Log-likelihood function" is $J(\vec{w},b)=\ln(L_{\vec{w},b}(\vec{x}))$= $\sum_{i=1:N}y_i\ln(f(\vec{x_i}))+(1-y_i)\ln(1-f(\vec{x_i}))$ the sum of all $N$ "cost functions" $C_i$. Here, $$f(\vec{x})= \frac{1}{1+\exp(-(\vec{w}\cdot \vec{x}+b))} $$ Making the transformation $\vec{w}\cdot \vec{x}+b \to z$, $f \to h_\theta$ is given by: $$ h_\theta (z) = \frac {1}{1+e^{-z}} $$ According to this site, the gradients of the cost functions are $\partial_{\vec{w}} C = \vec{x}(h_\theta - y)$ and $\partial_b C = h_\theta - y$. And here we are told that the formula for batch gradient descent is: $$ \theta_j = \theta_j - \alpha \partial_{\theta_j} J(\theta)$$. So I think that all these $\theta$'s are actually $\vec{w}$ and $b$ while the author takes $\alpha = 0.05$. But how does all this lead to the second table from the first?

1

There are 1 best solutions below

2
On

It is just the initial weight $W_0$, it doesn't come from the first table.

During the gradient descent process, the weight will then be updated iteratively using the gradient descent formula.

The goal is to learn a good set of parameters that would fit the data well.