Possibility of optimizing an embedding function in embedding space.

82 Views Asked by At

I have an optimization problem of the form: $$\text{minimize}_\theta:\mathbf{E}(f_\theta(y)-f_\theta(g(x)))^2$$ wherein $g$ is a known linear function, and $f$ is a nonlinear function parametrized by $\theta$. My question is whether or not this form of optimizing this objective via stochastic gradient descent is likely to degenerate. For example, if $g(x)=3x$, then $f(x)=0$ is a degenerate solution. I imagine some simple regularization should prevent this sort of degenerate behavior, but perhaps these edge cases are more numerous than they seem?

I know I could optimize this by arranging the objective as: $$\text{minimize}_\theta:\mathbf{E}(y-f^{-1}_\theta(f_\theta(g(x))))^2$$ but I'm assuming $f_\theta$ isn't invertible. I'm sorry if any of this is trivial; I've simply never enountered problems of this form before, and don't know where else to look for answers.

1

There are 1 best solutions below

0
On BEST ANSWER

I am posting a "placeholder" answer as per request, which summarizes the above comments and discussions.

I believe this is a situation where we have a random vector $(X,Y)$, where $X$ and $Y$ are possibly dependent. The joint distribution may or may not be known. In the system under study, we can observe $X$ but not $Y$, and we want to build a good estimator for $Y$. According to standard theory, the best mean-square-error estimator is $E[Y|X]$, but we may want to do something else for various reasons.

Overall, we consider a class of functions $r(\theta) = E[h(\theta, W)]$ for some random variable $W$ that depends on either $X, Y$, or both, where $\theta$ is a vector and the function $h$ specifies some objective of interest. Likely we have some training data from which stochastic gradients of $r(\theta)$ can be computed.

The question wonders about "degenerate" cases when $r(\theta)=0$ for a particular choice of $\theta$. I suspect that this issue is essential in the problem formulation stage. That is, you want to choose the domain of $\theta$ carefully, along with the optimization metrics, so that your resulting $r(\cdot)$ function represents a meaningful thing to optimize. So, meaningful problems will likely avoid these "degenerate" cases.

However, once we agree on a specific $r$ function (typically a nonconvex function), the optimization procedure likely does not "care" about existence of cases that we would call "degenerate." If you have some way of computing a stochastic gradient of $r(\theta)$, and if you assume $r(\theta)$ is smooth with Lipschitz continuous gradients, you can use a standard gradient method with a sufficiently small stepsize, where stepsize is chosen with respect to the Lipshitz parameter and the variance of the stochastic gradient. Since the problem is nonconvex, you can only claim you are getting close to a point of near-zero derivative, not necessarily a global optimum.