Chain rule involving intermediate functional derivatives

50 Views Asked by At

I'm trying to understand the functional derivative applied to certain optimization problems of the following form.

Let $\mathcal{F}$ be a space of functions $f : \mathcal{X} \to \mathcal{Y}$. Suppose $L: \mathcal{F} \to \mathbb{R}$ is a functional over the space of functions $\mathcal{F}$. A common optimization problem is thus \begin{equation} \min_{f \in \mathcal{F}} L(f). \end{equation}

This is often not tractable in practice, so we can assume $f$ belongs to some parametrized set of functions $\mathcal{G}$, such that $\mathcal{G} \ni f: \mathcal{X} \times \Theta \to \mathcal{Y}$, where $\Theta$ is some space of parameters, say Euclidean space for simplicity, i.e., $\Theta = \mathbb{R}^n$. Then, the optimization can be performed over all $\theta \in \Theta$: \begin{equation} \min_{f \in \mathcal{G}} L(f) = \min_{\theta \in \Theta} L(f(\cdot, \theta)). \end{equation}

Now, first-order optimality conditions imply an optimal vector $\theta^*$ satisfies \begin{align} \frac{\partial}{\partial \theta}L(f(\cdot, \theta^*)) &= \mathbf{0}. \end{align} We can think of $L$ here as being reduced to a simple function $L : \Theta \to \mathbb{R}$, and so a typical gradient descent algorithm should be possible. But, due to the intermediate function $f$, the chain rule seems to imply something like: \begin{equation} \frac{\partial}{\partial \theta}L(f(x, \theta^*)) = \frac{\delta L}{\delta f}(x) \frac{\partial}{\partial \theta}f(x, \theta^*) = \mathbf{0}. \end{equation}

$\frac{\delta L}{\delta f}$ is the functional derivative of $L$. This seems to imply: (1) the partial derivative of $L$ w.r.t. $\theta$ is a function, and (2) the partial derivative is identically equal to zero, which makes sense intuitively if $f$ is a minimizer. Is this application of the chain rule correct?

Now, in order to apply a gradient descent algorithm, we can't calculate this derivative at every point $x$, but we could attempt to approximate it at finitely many points from $\mathcal{X}$ (using whatever additional assumptions we might need). Is this the right intuition?