Logistic regression: Non-convergence issue

407 Views Asked by At

I'm trying to perform a simple binary logistic regression over the MNIST dataset using Python. The problem I'm experiencing is that when I try to optimize the $\vec{\theta}$ model parameters using scipy.optimize.minimize function (Newton-CG method) it throws the following warning:

  Warning: Desired error not necessarily achieveddue to precision loss

and the optimization is very poor, and I don't know why.

The dataset is as below:

  1. 60000 28x28 training images of handwritten numbers (read as a $(m, n)$ matrix ${X}$, where $m$ is the number of images and $n$ is the number of pixels)
  2. 60000 training labels (read as a $(m, 1)$ vector $\vec{y}$, where $m$ is the number of images)
  3. 10000 28x28 test images of handwritten numbers (read as a $(p, n)$ matrix ${X_t}$, where $p$ is the number of (test) images and $n$ is the number of pixels)
  4. 10000 test labels (read as a $(n, 1)$ vector $\vec{y_t}$, where $p$ is the number of (test) images)

I perform a preprocessing over this data which consists in the following steps:

  1. Standardization over each feature:

    $$z_j = \frac{\vec{x_j} - \mu_j}{\sigma_j}$$

    Here $x_j$ is the j-th feature vector (the j-th column of ${X}$), $\mu_j$ is the mean value of $\vec{x_j}$ and $\sigma_j$ is the standard deviation of $\vec{x_j}$.

  2. Select only binary digits (0,1)

  3. Add an intercept term (a first columns filled with ones)

With the data ready to be used in a training algorithm I've coded it as a class:

class LogisticRegression(object):
""" Logistic regression class

Implementation of common methods used in logistic regression problems

.. data:: data

   Dataset (train and test) as a Data class

.. data cost

   Logistic cost

.. data:: jac

   Jacobian of the cost function (logistic)

"""
def __init__(self, data):
    """ Constructor

    :param data: Input data
    :type data: Data class

    """
    self.data = data
    self.cost = None
    self.jac = None
    self.theta = None
    self.init_params()

def sigmoid(self, x):
    """ Compute the sigmoid function

    Compute the sigmoid function:

    .. math::

        \\sigma(x) = \\frac{1}{1 + e^x)}

    .. note::

        This version of sigmoid tries to avoid overflows in the computation
        of :math:`e^{-x}` by using an alternative formulation when x is 
        negative, to get 0:

        .. math::

            \\sigma(x) = \\frac{e^{x}}{1 + e^{x})}

        This is equivalent to the definition of sigmoid, but we won't 
        get :math:`e^{-x}` to overflow when x is very negative.

        Since both the x and y arguments to np.where are evaluated by Python,
        we may still get overflow warnings for large x elements; therefore
        we ignore warnings during this computation.


    :param x: Input
    :type x: Scalar or numpy ndarray
    :returns: Sigmoid function result
    :rtype: Scalar or numpy ndarray

    """
    with np.errstate(over='ignore', invalid='ignore'):
        return np.where(x >= 0,
                        1 / (1 + np.exp(-x)),
                        np.exp(x) / (1 + np.exp(x)))

def hypothesis(self, theta, x):
    """Makes classification predictions for the data in x using theta.

    :param theta: Logistic regression parameters (n, 1)
    :param x: Feature matrix (m, n
    :type theta: Numpy ndarray
    :type x: Numpy ndarray
    :returns: Logistic regression prediction (:math:`\\hat{y}`) with shape (m, 1), a
              number in the range (0.0, 1.0) for each sample. The number is the 
              probability that the sample is classified as 1.

    :rtype: Numpy ndarray with shape (n, 1)
    """
    z = x.dot(theta)
    return self.sigmoid(z)

def log_cost(self, theta, x, y, inplace=True):
    """ Logistic cost function (cross-entropy error function)

    Compute the cross-entropy error function for a logistic regression
    problem:

    .. math::

        J(\\theta) = -\\frac{1}{m}\\sum_i^m\\left(y_i log(h_{\\theta}(x_i)) + (1-y_i)\
            log(1-h_{\\theta}(x_i))\\right)

    where

    .. math::

        h_{\\theta}(x_i) = sigmoid(\\theta^tx)

    .. note::

        Calls to hypothesis may produce probabilities that are
        0.0 and 1.0, due to the exponent in sigmoid and the large numbers
        involved. np.log on 0.0 overflows, so we clip the input of np.log to
        values very close to 0 instead.

    :param theta: Parameters vector
    :param x: Feature matrix (m, n)
    :param y: Labels vector (m, 1)
    :param inplace: If True assigns the result to the interal parameter cost
    :type theta: Numpy ndarray (shape (n, 1))
    :type x: Numpy ndarray (m, n)
    :type y: Numpy ndarray (m, 1)
    :type inplace: Boolean
    :returns: Cost
    :rtype: Scalar
    """
    m, n = x.shape
    theta = theta.reshape(n, 1)
    eps = np.finfo(np.float32).eps
    yhat_prob = np.clip(self.hypothesis(theta, x),
                        a_min=eps,
                        a_max=1.0-eps)
    loss = np.mean(np.where(y == 1,
                            -np.log(yhat_prob),
                            -np.log(1 - yhat_prob)))
    if inplace:
        self.cost = loss.flat[0]
    return loss.flat[0]

def log_cost_jac(self, theta, x, y, inplace=True):
    """ Logistic cost function Jacobian computation

    Computes the Jacobian of the logistic cost function:

        .. math::

            \\Delta(J(\\theta)) = \\sum_i(x^i(h_{\\theta}(x^i) -y^i)

    :param theta: Parameters vector 1D
    :param x: Feature matrix (m, n)
    :param y: Labels vector (m, 1)
    :param inplace: If True assigns the result to the interal parameter cost
    :type theta: Numpy ndarray (shape (n, 1))
    :type x: Numpy ndarray (m, n)
    :type y: Numpy ndarray (m, 1)
    :type inplace: Boolean
    :returns: Logistic cost function Jacobian
    :rtype: Numpy ndarray (n,) where n is the number of parameters (theta) (scipy minimize needs this
            shape)

    """
    m, n = x.shape
    theta = theta.reshape(n, 1)
    eps = np.finfo(np.float32).eps
    yhat_prob = np.clip(self.hypothesis(theta, x),
                        a_min=eps,
                        a_max=1.0-eps)
    yh = np.where(y.reshape(len(y), 1) == 1, yhat_prob - 1, yhat_prob)
    dtheta = np.dot(yh.T, x).T / float(m)
    if inplace:
        self.jac = dtheta
    return dtheta.flatten()

def log_cost_opt(self, method='Newton-CG', jac=True):
    """ Optimization of the logistic cost function over the theta parameters.

    """
    if jac:
        result = optimize.minimize(fun=self.log_cost,
                                   x0=self.theta.flatten(),
                                   method=method,
                                   args=(self.data.trainX[:20, :], self.data.trainY[:20], True),
                                   jac=self.log_cost_jac)
    else:
        result = optimize.minimize(fun=self.log_cost,
                                   x0=self.theta.flatten(),
                                   method=method,
                                   args=(self.data.trainX[:20, :], self.data.trainY[:20], True))
    return result



def init_params(self):
    """ Init parameter vector
    """
    m, n = self.data.trainX.shape
    self.theta = np.random.rand(n, 1)

The cost function is

$$J(\theta) = -\frac{1}{m}\sum_i^m\left(\vec{y_i} log(h_{\theta}(\vec{x_i})) + (1-\vec{y_i}) log(1-h_{\theta}(\vec{x_i}))\right)$$

being $$h_{\theta}(\vec{x_i}) = \frac{e^{\vec{w}^t \vec{x_i}}}{1+e^{\vec{w}^t \vec{x_i}}}$$

and the Jacobian and Hessian are:

$$\Delta(J(\vec{\theta})) = \sum_i(\vec{x_i}(h_{\theta}(\vec{x_i}) -\vec{yi}) = X(\vec{h} - \vec{y}) $$

$$ Hess(J(\vec{\theta})) = X^tSX$$

where $S = diag(\vec{h_i}(1-\vec{h_i})$.

So, I run the optimization but the function is throwing the error I've commented above. It is supposed to converge because the Hessian is positive definite but, if I compute the Hessian for the initial parameters (random) the Hessian is not positive definite. By the other side, if I try to optimize the parameters using Powell method it also gives a poor result but don't throw any error/warning.

Could you please help me with this issue please?

Thanks in advance,