Stochastic Gradient Descent for Ratio of non-linear functions

87 Views Asked by At

I came across a paper where the loss function to optimize is a ratio of two non-linear functions, which looks like this:

<span class=$w$ is the parameter and \delta is an indicator variable" />

Citing that the function cannot be optimized with SGD in this form, they formulate it as a constraint optimization problem as follows:

enter image description here

I can understand till this point, but after formulating the problem in this form, they (re)-formulate this problem as a series of unconstraint optimization problems as follows:

enter image description here

enter image description here

Where it's claimed that $S_j$ does not influence the parameter values, and the final optimization objective is:

enter image description here

I couldn't follow the logic. Is this a standard deduction of a fractional optimization problem? Will this formulation give sub-optimal results as compared to the original one?

Paper Link: https://www.cs.cornell.edu/people/tj/publications/joachims_etal_18a.pdf

1

There are 1 best solutions below

3
On

They give not a reformulation, but an algorithm to compute the optimal value of the original problem.

Assuming you have the optimal $S^*$, the first two problems are equivalent. But normally you don't know the value of $S^*$, so you can explore the range of these sum values $S$. So for each $S_j$ from the range the constrained optimization problem has to be solved.

The authors write the Lagrange multipliers form for the constrained optimization problem. And note that for each value of $S_j$ there will be a specific value of the Lagrange multipliers. So instead of fixing the value of $S_j$ they fix the value of the Lagrange multipliers $\lambda$ and solve the unconstrained optimization of the Lagrange function with $\lambda$ fixed. And for each value of $\lambda$ (call it $\lambda_i$) there will be a specific value of the constraining value $S_i$, so they still explore the range for $S$ by fixing different values of $\lambda$.

Then having several values of $\hat w$ (call it $\hat w_i$ for each $\lambda_i$) they select the value $i^*$ which gives the minimal $R_{\textbf{SNIPS}}(\pi_{\hat w_{i^*}})$ among all $R_{\textbf{SNIPS}}(\pi_{\hat w_{i}})$.