I have two related questions regarding the correct way to perform numerical gradient descent for an optimal control problem with a non-uniform mesh/grid.
1)
Say I am solving a PDE-constrained optimisation problem, and I have constructed a Lagrangian $\mathcal{L}$ which incorporates the cost functional as well as constraints for the PDE with an associated adjoint variable. Something like $$\mathcal{L} = \mathcal{C} - \langle \lambda, - \mathcal{A} u - f \rangle$$
I construct the adjoint system by taking variations of $\mathcal{L}$ in the state variable, and I can take variations in the control $f$ to construct a variational inequality to be satisfied by the optimal control.
When implementing a numerical scheme, I would usually perform a linesearch with step direction the negative of the sensitivity inside the variational inequality: $$\frac{\delta \mathcal{L}}{\delta f}.$$ Now consider, in a finite volume setting, the numerical gradient of $\mathcal{L}$ with respect to $f_i$ (the value of $f$ at the $i^{\textrm{th}}$ node): $$V_i \frac{\delta \mathcal{L}}{\delta f}$$ If we have a uniform mesh with all volumes $V_i$ being equal, this is just a reweighting of the sensitivity which will result in a different linesearch stepsize. On a nonuniform grid, this results in a completely different descent direction.
I'm not sure which of these I think is correct anymore. In the first case, the sensitivity is smoothly varying and so the updated control $f$ will also remain smooth. In the second case, we are computing the numerical gradient which makes sense as we are solving the numerical problem. I must be missing something here!
2)
My second question is very related. Consider an optimisation problem where I am simultaneously performing gradient descent in some optimal control (function) and a single model parameter. I would like to construct a step direction composed from the gradients of both of these. The best way to explain is with an example.
Consider an optimisation problem for an unsteady heat source and a diffusion coefficient over space interval $[0,L]$ and time interval $[0,T]$. The Lagrangian is $$\mathcal{L} = \mathcal{C} - \int_{0}^T \int_0^L \lambda (u_t + c \Delta u - f(x,t)) $$ where $c$ is constant and $f$ is a function of the space and time.
Considering $c$ as a function of space and time, we have the functional derivative $$\frac{\delta \mathcal{L}}{\delta c} = - \lambda \Delta u $$ Obviously we cant allow $c$ to vary in space and time so I would have to project the sensitivity onto the space of constant functions, $$\frac{1}{T}\frac{1}{L} \int_0^T \int_0^L - \lambda \Delta u, $$ or just add it and then project afterwards (same outcome).
A straight derivative of the Lagrangian would give the same answer but without the rescaling $$\frac{\mathrm{d} \mathcal{L}}{\mathrm{d} c} = \int_0^T \int_0^L - \lambda \Delta u $$
If I was optimising for the diffusion coefficient only then this lack of the rescaling isn't an issue. But since I want to construct a joint gradient in $(c,f)$, where $$\frac{\delta \mathcal{L}}{\delta f} = \lambda$$ I need to work out the correct way to weight these two individual gradients.
The answer to the first question also comes into this. Should I not use functional calculus at all and just calculate numerical gradients in $(c,f_1,\ldots,f_n)$ instead? If there is any good literature for dealing with optimal control problems involving coefficients which may or may not have space/time dependence then I would be interested to read.
After some reading, I think I can answer part of my first question (or at least it is too long for a comment).
If we let $F$ denote the space of feasible controls, we define the reduced cost functional, $\zeta: f \in F \rightarrow \mathbb{R}$, as $$\zeta(f) = \mathcal{C}(u(f),f).$$ The necessary optimality condition (variational inequality) to be satisfied by the optimal control $\overline{f}$ is $$\zeta'(\overline{f}) (f - \overline{f}) = \langle \zeta'(\overline{f}) , f - \overline{f} \rangle_{F^*,F} \geq 0, \quad \forall f \in F$$ where $\zeta'(\overline{f}) \in F^*$ denotes the Fréchet derivative of $\zeta$ at $\overline{f} \in F$.
To construct a gradient from this Fréchet derivative, we need to choose an inner product space and apply the Riesz representation theorem, $$\langle \zeta'(\overline{f}) , f - \overline{f} \rangle_{F^*,F} = \langle \nabla\zeta(\overline{f}) , f - \overline{f} \rangle_{F},$$ which identifies the Fréchet derivative $\zeta'(\overline{f}) \in F^*$ with a gradient $\nabla\zeta(\overline{f})\in F$.
If we choose $L^2$ as the inner-product space, the gradient is the functional derivative $\delta \mathcal{L}/\delta f$, and if a discrete $\ell^2$ inner-product space is chosen based on the coefficients for a finite volume description, you obtain the functional derivative weighted by the volumes $V_i$.
While both gradients (or others derived from Riesz representations with different inner-products) are not incorrect, it appears that there can be convergence issues associated with choosing the "incorrect" inner-product space: An iteration count estimate..., Schwedes et al. 2016. Even after having read this paper, it is still not clear to me how to choose correctly to avoid mesh-dependent convergence of the optimisation algorithm - we should use "the inner-product induced by the control space", but we don't make the decision on the control space until we have to deal with the variational inequality(?)...I feel like its probably best to use $L^2$ Riesz representation when performing continuous optimisation, and a discrete representation for discrete adjoints.