Lacking in the Intuition behind the Logistic Regression Cost and Update Functions

1.4k Views Asked by At

I am lacking in intuition about the logistic regression cost and update functions. For example, in the cost function of Cost Function where

Sigmoid Function

why is log used. Is it just to make computations easier? Could log not be used and still work the same? Since likelihood is the inverse of probrability couldn't the inverse of the sigmoid function be used instead?

Also, is there any reason other than a coincidence, that the derivative of both the logistic and linear regression is the cost function times $x^(i)$?

2

There are 2 best solutions below

3
On BEST ANSWER

We are given a data vector $\textbf{x}$ and a class vector $\textbf{y}$. The class vector tells us which of two classes $\{0,1\}$ the data instances belong to.

We want to come up with a function $h_\theta(x_i)$ that helps us estimate the classes $y_i$ as best as we can.

You can think of $h_\theta(x_i)$ as the probability that $y_i=1$, given $x_i$ and $\theta$.

$$P(y_i=1|x_i,\theta)=h_\theta(x_i)$$

Likewise, $1-h_\theta(x_i)$ is the probability that $y_i=0$, given $x_i$ and $\theta$.

$$P(y_i=0|x_i,\theta)=1-h_\theta(x_i)$$

We can combine the two formulas in a clever way using exponents:

$$P(y_i|x_i,\theta)=h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$

(Note that one of the terms is always reduced to 1 because one of the exponents is always zero.)

Since all instances are independent, the total probability over all instances $i$ is just the product of all the individual probabilities:

$$P(\textbf{y}|\textbf{x},\theta)=\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$


We are hoping to maximize the probability of the output vector $\textbf{y}$, or equivalently, maximize its $\textbf{log}$.

$$\log\big( P(\textbf{y}|\textbf{x},\theta)\big)=\log\big(\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$

$$=\sum_i \log\big(h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$

$$=\sum_i \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$

This is a function with a maximum, but since we want to use gradient descent, we can just throw a negative sign in front to turn the maximum into a minimum, and scale it by the number of instances $m$ for convenience. (This makes the error more or less invariant to the number of instances.)A negative sign is added to the front since we want to reduce the cost function. The log loss function is strictly increasing, but by adding a negative sign we invert it.

$$J(\theta) = -\frac{1}{m}\sum_i^m \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$

Why did we bother taking the $\log$? Because it's easier to take the derivative of a sum rather than the derivative of a product (imagine all that product rule!). You'll find this trick is used a lot in machine learning to make differentiation easier.


You also asked why we chose $h_\theta(x_i)$ to be the sigmoid function. A couple reasons are:

  1. It is differentiable (unlike the unit step function), so we can use gradient descent with it.
  2. Its domain is the whole real line $\mathbb{R}$ and its range is $[0,1]$, which seems like a good idea for a binary classification problem with no constraints on the input value.

To me it seems like the use of sigmoid is more of an engineered solution rather than something we arrived at from a mathematical proof. It has nice properties and seems to work.

In neural networks, some people prefer using alternatives to the sigmoid like $arctan$ and $tanh$, but I don't think it makes that much of a difference in most cases.

enter image description here

2
On

First think the objective of logistic regression. We want to design a system that takes as input $\mathbf{x}$ and spits out a binary decision: a $0$ or $1$ label.

For simplicity, we assume that this system can be described just by a vector parameter $\theta$ and determines the output according to $h_{\theta}(\mathbf{x})$. The choice of this function is arbitrary -- no one says that it is necessarily true or the best choice for the problem at hand every time. It is a common choice because it is simple and yields output in the desired range $[0, 1]$.

Now, we want to find what is a good choice for the parameter $\theta$. To do that, we have some training samples (labeled samples) and we design a cost function that for each value of $\theta$ describes how well the system performs: the better the performance, the lower the cost. Intuitively, this cost function should be such that if a train sample is labeled $1$ (i.e., $y^{(i)}=1$), then it should penalize the system when the latter outputs values close to $0$. Even better, the closer the output is to zero, the more it should be penalized. On the contrary, when the system outputs $1$, it should ideally incur zero cost. Hence $-\log(h_{\theta}(\mathbf{x}))$ sounds like a good choice. Plus it may have some additional benefits when it comes to optimizing the cost function. (The same applies by symmetry to $1-h_{\theta}(\mathbf{x})$ for $0$-labeled samples.)