Perceptron and gradient descent

406 Views Asked by At

Let imagine the simpliest case where we have a set of point with some label in $\{1,-1\}$ such that the two group of point (respectively to their label) are perfectly well separated by an hyperplane of the form $\sum w_ix_i-\theta=0$.

In this case some books speak about a naive mathod based on "batch gradient descent" algorithm (without any stochastic aspect since the 2 group of point are well separated).

1) I am very curious about the function that we want to minimize in this case ? For example does the first algorithme describe here: http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html derivate from a gradient descent? If yes what is the convex cost function (perhaps there are many) upon wich we apply gradient descent ? I would be very pleased if you could explain me the mathematicals details.

2) Refering to https://stats.stackexchange.com/questions/138229/from-the-perceptron-rule-to-gradient-descent-how-are-perceptrons-with-a-sigmoid i saw plenty of error where the function that is about to be minimized isn't derivable (and we know that gradient descent suppose at minimum differentiability ).

In this example :we can see that $J(w)$ involve $\hat{y} = \operatorname{sign}(\mathbf{w}^T\mathbf{x}_i)$ and the function "sign" isn't differentiable at $0$. If we decide to forget the "sign" function the problem loose all its sense.

In this context how can we apply the gradient descent?

Edit:The 2) is now resolved thanks to your help. Could you clarify 1) based on the link?

Thanks a lot for your amazing help!

1

There are 1 best solutions below

2
On

Hint: check subdifferentiability in the context of convex analysis.