I am trying to solve the below optimization problem using proximal gradient descent on a dataset in python:
$f(\theta) = \arg\min_{\theta \in R^d}\frac{1}{m}\sum_{i=1}^m\Big [log(1+exp(x_i\theta))-y_ix_i\theta\Big]+\frac{\lambda_2}{2}||\theta||^2_2 + \lambda_1||\theta||_1$
where $x_i \in R^{1 \times d}$ is sample $i$, $\theta \in R^d$ and $y_i \in \{0,1\}$ is the label for sample $i$ and $\lambda>0$.
The gradient of $g(\theta)$ being
$\nabla g(\theta) = \frac{1}{m}\sum_{i=1}^m\Big[\frac{x_ie^{x\theta}}{1+e^{x_i\theta}} - x_iy_i\Big]+\theta \lambda_2$
The dataset contains 784 columns and 2000 datapoints half of which i use for learning $\theta$ and the remaining for evaluating accuracy of the classifier. The $\theta$ learnt is used to predict labels given by $\frac{1}{1+exp(-x\theta)}$. I need to build the classifier using proximal operators, and plot the accuracy per iteration.
This is the code i have come up with but my accuracy is way off, it is expected to converge with >98% accuracy in one to 5 iterations max.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from sklearn.metrics import accuracy_score
np.random.seed(100)
data = pd.read_csv("data.csv",header = None).values
A = data[:,0:784]
y = data[:,-1]
A_train = A[:1000,]
y_train = y[:1000,]
A_test = A[1000:,]
y_test = y[1000:,]
lam_1 = 10
lam_2 = 5
m = len(A_train)
def prox_op(theta,t):
a = np.maximum(theta - t, 0)
b = np.minimum(theta + t, 0)
return a+b
def func(theta):
func_sum = 0
for i in range(m):
func_sum += np.log(1+np.exp(A_train[i]*theta)) - y_train[i]*A_train[i]*theta
term2 = lam_2/2 * np.linalg.norm(theta)
term3 = lam_1 * np.linalg.norm([theta], ord = 1)
return func_sum/m + term2 + term3
def grad(theta):
for i in range(m):
grad_sum = (A_train[i] * np.exp(A_train[i] * theta)/(1 + np.exp(A_train[i] * theta))) - (A_train[i] * y_train[i])
return grad_sum + (theta * lam_2)
t = 0.01
iterations = 100
theta_new = np.random.uniform(-1,1,784)
theta_list = [theta_new]
func_list = [func(theta_new)]
acc_sc = []
for i in range(iterations-1):
theta_current = theta_list[-1]
grad_g = grad(theta_current)
theta_new_gd = theta_current - t * grad_g;
theta_new = prox_op(theta_new_gd,lam_1*t)
theta_list.append(theta_new)
func_list.append(func(theta_new))
yy_tr = 1/(1+np.exp(np.dot(-A_train, theta_new)))
acc_sc.append(accuracy_score(yy_tr>t*lam_1,y_train))
plt.plot(func_list)
plt.title('Proximal GD')
plt.xlabel('Iteration')
plt.ylabel('Function Value')
plt.show()
Could some one point where im going wrong please?
This is the plot of function value vs #iterations: plot
Before doing the proximal GD, you can (should ?) that your analytical gradient is correct. I suspect it should be
*** grad_sum += *** instead of *** grad_sum = *** since you are summing over the examples. Also the normalisation term has disappeared...
Also you give the gradient but not the proximal update so this is not 'easy' to detect where your error is located.
Using slightly different notations, denote $$ \phi(\mathbf{w}) = g(\mathbf{w}) + \lambda_1 \|\mathbf{w}\|_1 $$ where $g(\mathbf{w}) =\frac{1}{N} \sum_{n=1}^N \left[ \log \left(1+e_n\right) - y_n \mathbf{x}_n^T \mathbf{w} \right] + \frac{\lambda_2}{2} \|\mathbf{w}\|^2_2 $ and the scalar $e_n=\exp^{\mathbf{x}_n^T \mathbf{w}}$.
The update requires the soft-thresholding operator $$ \mathbf{w} \leftarrow S_{\lambda_1 t} \left[ \mathbf{w} -t \nabla g(\mathbf{w}) \right] $$ where $\nabla g(\mathbf{w}) =\frac{1}{N} \sum_{n=1}^N \left[ \left( \frac{e_n}{1+e_n}-y_n \right) \mathbf{x}_n \right] + \lambda_2 \mathbf{w} $.