I'm trying to perform a simple binary logistic regression over the MNIST dataset using Python. The problem I'm experiencing is that when I try to optimize the $\vec{\theta}$ model parameters using scipy.optimize.minimize function (Newton-CG method) it throws the following warning:
Warning: Desired error not necessarily achieveddue to precision loss
and the optimization is very poor, and I don't know why.
The dataset is as below:
- 60000 28x28 training images of handwritten numbers (read as a $(m, n)$ matrix ${X}$, where $m$ is the number of images and $n$ is the number of pixels)
- 60000 training labels (read as a $(m, 1)$ vector $\vec{y}$, where $m$ is the number of images)
- 10000 28x28 test images of handwritten numbers (read as a $(p, n)$ matrix ${X_t}$, where $p$ is the number of (test) images and $n$ is the number of pixels)
- 10000 test labels (read as a $(n, 1)$ vector $\vec{y_t}$, where $p$ is the number of (test) images)
I perform a preprocessing over this data which consists in the following steps:
Standardization over each feature:
$$z_j = \frac{\vec{x_j} - \mu_j}{\sigma_j}$$
Here $x_j$ is the j-th feature vector (the j-th column of ${X}$), $\mu_j$ is the mean value of $\vec{x_j}$ and $\sigma_j$ is the standard deviation of $\vec{x_j}$.
Select only binary digits (0,1)
Add an intercept term (a first columns filled with ones)
With the data ready to be used in a training algorithm I've coded it as a class:
class LogisticRegression(object):
""" Logistic regression class
Implementation of common methods used in logistic regression problems
.. data:: data
Dataset (train and test) as a Data class
.. data cost
Logistic cost
.. data:: jac
Jacobian of the cost function (logistic)
"""
def __init__(self, data):
""" Constructor
:param data: Input data
:type data: Data class
"""
self.data = data
self.cost = None
self.jac = None
self.theta = None
self.init_params()
def sigmoid(self, x):
""" Compute the sigmoid function
Compute the sigmoid function:
.. math::
\\sigma(x) = \\frac{1}{1 + e^x)}
.. note::
This version of sigmoid tries to avoid overflows in the computation
of :math:`e^{-x}` by using an alternative formulation when x is
negative, to get 0:
.. math::
\\sigma(x) = \\frac{e^{x}}{1 + e^{x})}
This is equivalent to the definition of sigmoid, but we won't
get :math:`e^{-x}` to overflow when x is very negative.
Since both the x and y arguments to np.where are evaluated by Python,
we may still get overflow warnings for large x elements; therefore
we ignore warnings during this computation.
:param x: Input
:type x: Scalar or numpy ndarray
:returns: Sigmoid function result
:rtype: Scalar or numpy ndarray
"""
with np.errstate(over='ignore', invalid='ignore'):
return np.where(x >= 0,
1 / (1 + np.exp(-x)),
np.exp(x) / (1 + np.exp(x)))
def hypothesis(self, theta, x):
"""Makes classification predictions for the data in x using theta.
:param theta: Logistic regression parameters (n, 1)
:param x: Feature matrix (m, n
:type theta: Numpy ndarray
:type x: Numpy ndarray
:returns: Logistic regression prediction (:math:`\\hat{y}`) with shape (m, 1), a
number in the range (0.0, 1.0) for each sample. The number is the
probability that the sample is classified as 1.
:rtype: Numpy ndarray with shape (n, 1)
"""
z = x.dot(theta)
return self.sigmoid(z)
def log_cost(self, theta, x, y, inplace=True):
""" Logistic cost function (cross-entropy error function)
Compute the cross-entropy error function for a logistic regression
problem:
.. math::
J(\\theta) = -\\frac{1}{m}\\sum_i^m\\left(y_i log(h_{\\theta}(x_i)) + (1-y_i)\
log(1-h_{\\theta}(x_i))\\right)
where
.. math::
h_{\\theta}(x_i) = sigmoid(\\theta^tx)
.. note::
Calls to hypothesis may produce probabilities that are
0.0 and 1.0, due to the exponent in sigmoid and the large numbers
involved. np.log on 0.0 overflows, so we clip the input of np.log to
values very close to 0 instead.
:param theta: Parameters vector
:param x: Feature matrix (m, n)
:param y: Labels vector (m, 1)
:param inplace: If True assigns the result to the interal parameter cost
:type theta: Numpy ndarray (shape (n, 1))
:type x: Numpy ndarray (m, n)
:type y: Numpy ndarray (m, 1)
:type inplace: Boolean
:returns: Cost
:rtype: Scalar
"""
m, n = x.shape
theta = theta.reshape(n, 1)
eps = np.finfo(np.float32).eps
yhat_prob = np.clip(self.hypothesis(theta, x),
a_min=eps,
a_max=1.0-eps)
loss = np.mean(np.where(y == 1,
-np.log(yhat_prob),
-np.log(1 - yhat_prob)))
if inplace:
self.cost = loss.flat[0]
return loss.flat[0]
def log_cost_jac(self, theta, x, y, inplace=True):
""" Logistic cost function Jacobian computation
Computes the Jacobian of the logistic cost function:
.. math::
\\Delta(J(\\theta)) = \\sum_i(x^i(h_{\\theta}(x^i) -y^i)
:param theta: Parameters vector 1D
:param x: Feature matrix (m, n)
:param y: Labels vector (m, 1)
:param inplace: If True assigns the result to the interal parameter cost
:type theta: Numpy ndarray (shape (n, 1))
:type x: Numpy ndarray (m, n)
:type y: Numpy ndarray (m, 1)
:type inplace: Boolean
:returns: Logistic cost function Jacobian
:rtype: Numpy ndarray (n,) where n is the number of parameters (theta) (scipy minimize needs this
shape)
"""
m, n = x.shape
theta = theta.reshape(n, 1)
eps = np.finfo(np.float32).eps
yhat_prob = np.clip(self.hypothesis(theta, x),
a_min=eps,
a_max=1.0-eps)
yh = np.where(y.reshape(len(y), 1) == 1, yhat_prob - 1, yhat_prob)
dtheta = np.dot(yh.T, x).T / float(m)
if inplace:
self.jac = dtheta
return dtheta.flatten()
def log_cost_opt(self, method='Newton-CG', jac=True):
""" Optimization of the logistic cost function over the theta parameters.
"""
if jac:
result = optimize.minimize(fun=self.log_cost,
x0=self.theta.flatten(),
method=method,
args=(self.data.trainX[:20, :], self.data.trainY[:20], True),
jac=self.log_cost_jac)
else:
result = optimize.minimize(fun=self.log_cost,
x0=self.theta.flatten(),
method=method,
args=(self.data.trainX[:20, :], self.data.trainY[:20], True))
return result
def init_params(self):
""" Init parameter vector
"""
m, n = self.data.trainX.shape
self.theta = np.random.rand(n, 1)
The cost function is
$$J(\theta) = -\frac{1}{m}\sum_i^m\left(\vec{y_i} log(h_{\theta}(\vec{x_i})) + (1-\vec{y_i}) log(1-h_{\theta}(\vec{x_i}))\right)$$
being $$h_{\theta}(\vec{x_i}) = \frac{e^{\vec{w}^t \vec{x_i}}}{1+e^{\vec{w}^t \vec{x_i}}}$$
and the Jacobian and Hessian are:
$$\Delta(J(\vec{\theta})) = \sum_i(\vec{x_i}(h_{\theta}(\vec{x_i}) -\vec{yi}) = X(\vec{h} - \vec{y}) $$
$$ Hess(J(\vec{\theta})) = X^tSX$$
where $S = diag(\vec{h_i}(1-\vec{h_i})$.
So, I run the optimization but the function is throwing the error I've commented above. It is supposed to converge because the Hessian is positive definite but, if I compute the Hessian for the initial parameters (random) the Hessian is not positive definite. By the other side, if I try to optimize the parameters using Powell method it also gives a poor result but don't throw any error/warning.
Could you please help me with this issue please?
Thanks in advance,