Dear math enthusiasts,
I recently came across the adjoint (state) method in the context of sensitivity analysis of model perturbations to systems described by PDEs. I am a novice in the area so I was trying to understand the basic idea with simple examles. I was happy to find lecture notes from a Standford CS class that did help me in understanding things but I am still confused with the uniqueness (with respect to the example in $\mathbb R^N$) they are showing and this is where I need help.
To set the scene, let me briefly repeat the main statements of the problem. We want to minimize some function $f(x)$ subject to a constraint $g(x,p) = 0$ where $x, p$ live in some Hilbert spaces, $f$ maps to the real field and $g$ maps to another Hilbert space. For simplicity, I will use Euclidean spaces, i.e., $x \in \mathbb R^N$, $p \in \mathbb R^P$ and $g: \mathbb R^N \times \mathbb R^P \mapsto \mathbb R^Q$.
We are interested in computing the sensitivity of our cost $f(x)$ with respect to $p$, i.e., $\frac{\partial f}{\partial p}$. As $x$ depends on $p$ this could be done based on the chain rule, i.e., using $\frac{\partial f}{\partial p} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial p}$. Now, while $\frac{\partial f}{\partial x}$ is easy to compute, $\frac{\partial x}{\partial p}$ is hard and we want to avoid it.
The way to go is to define a Lagrangian $\mathcal L(x,p,\lambda) = f(x) + \lambda^T g(x,p)$. Then, since $g(x,p)=0$ we notice that (for every feasible solution), $\mathcal L(x,p,\lambda) = f(x)$ and $\lambda$ is arbitrary since $g(x,p)=0$. Therefore our desired $\frac{\partial f}{\partial p}$ is equal to $\frac{\partial \mathcal L}{\partial p}$, which we can then expand as $$ \frac{\partial \mathcal L}{\partial p} = \frac{\partial f(x)}{\partial x}\frac{\partial x}{\partial p} + \frac{\partial\lambda}{\partial p}^T g(x,p) + \lambda^T\left( \frac{\partial g(x,p)}{\partial x}\frac{\partial x}{\partial p} + \frac{\partial g(x,p)}{\partial p}\right) $$ The second term is zero (since $g(x,p)$ is zero). Collecting the remaining terms, we can write this as $$ \frac{\partial \mathcal L}{\partial p} = \left(\frac{\partial f(x)}{\partial x} + \lambda^T \frac{\partial g(x,p)}{\partial x} \right)\frac{\partial x}{\partial p} + \lambda^T \frac{\partial g(x,p)}{\partial p} $$ And now comes the magic: since $\lambda$ is arbitrary, to avoid having to calculate $\frac{\partial x}{\partial p}$ we can choose $\lambda$ such that $\frac{\partial f(x)}{\partial x} + \lambda^T \frac{\partial g(x,p)}{\partial x}=0$. Once such a $\lambda$ is found, we have $\frac{\partial f(x)}{\partial p} = \lambda^T \frac{\partial g(x,p)}{\partial p}$.
My actual question (sorry for the long foreword) relates to the existence and uniqueness of such a $\lambda$. Going back to the Euclidean example, as $g(x,p)$ maps $x \in \mathbb R^N$ to $\mathbb R^Q$ (for given $p$), $\lambda$ lives in $\mathbb R^Q$ as well. Also, $\frac{\partial g(x,p)}{\partial p}$ is $N \times Q$. Therefore $\frac{\partial f(x)}{\partial x} =- \lambda^T \frac{\partial g(x,p)}{\partial x}$ is a system of $N$ equations in $Q$ variables. We need it to have an exact solution, otherwise $\frac{\partial x}{\partial p}$ will not vanish. For this reason, the source I cited above actually assumes $Q=N$. But then if I have $N$ constraints on $\mathbb x \in \mathbb R^N$, this leaves no degrees of freedom to optimize so this doesn't make sense to me?
To be even more concrete, an example also used in the lecture notes is $g(x,p) = A(p) x - b$, though let us use $A(p) \in \mathbb R^{Q \times N}$ for now. If we try the adjoint method, the condition for $\frac{\partial x}{\partial p}$ to vanish is $A(p)^T \lambda +\frac{\partial f}{\partial x}=0$. From here it seems for a solution to exist, $A(p)$ must have rank $N$ which requires $Q\geq N$ and $A(p)$ full rank. But then $A(p)x=b$ only has one solution, so there is nothing to optimize.
My suspicion here is that what I didn't consider so far is that $\frac{\partial f}{\partial x}$ is not actually arbitrary. Since we evaluate it at a point where $g(x,p) = 0$, it lives in the subspace spanned by $A(p)$ and this is how things come together (e.g., for $f(x) = \frac 12 \|x\|^2$ we would have $x_{opt} = A(p)^+ b$). But that's just an argument in this very specific example of having linear constraints.
So, framing it concisely: Can we show in general when $\frac{\partial f(x)}{\partial x} + \lambda^T \frac{\partial g(x,p)}{\partial x} = 0$ has a solution and if this solution is unique? Ideally, not considering Euclidean spaces but a more general Hilbert space setting? What I am really trying to understand is how to do all this if what we want to optimize for are functions (trying to wrap my head around full wave inversion, actually).
Since $x$ is the solution of an optimization problem in this situation, your question basically asks for the existence and uniqueness of a Lagrange multiplier $\lambda$ (this is because the adjoint equation $\frac{\partial f(x)}{\partial x}+\lambda^T \frac{\partial g(x,p)}{\partial x}=0$ is equivalent to an equation in the KKT system).
This question is adressed here on wikipedia, where you can just ignore all the inequalitiy constraints for your case.
Rewriting your adjoint equation with gradients, it is $$ \nabla f(x) + \sum_i \lambda_i \nabla_x g_i(x,p), $$ which is exactly a line from the KKT conditions. Note that on wikipedia they use $$ \nabla f(x^*) + \sum_{i=1}^m \mu_i \nabla g_i(x^*) + \sum_{j=1}^\ell \lambda_j \nabla h_j(x^*) = \mathbf 0. $$ If you consider that $g_i$ can be ignored here because you have no inequalities and that the $h_j(x)$ in Wikipedia is the equality constraint $g_i(x,p)$ here, then you can see that the two equations are the same.
There are also many conditions mentioned when $\lambda$ exists. For example, LCQ and LICQ. LCQ is satisfied if $g$ is affine in $x$, which is the case for $g(x,p)=A(p)x-b$. So in this case the existence of $\lambda$ follows from KKT theory. In general, $\lambda$ is not unique here.
LICQ in your case means that the rows of $\frac{\partial g}{\partial x}$ are linearly independent. This means that $\frac{\partial g}{\partial x}$ is surjective, or its transpose is injective. In this case, the uniqueness of $\lambda$ follows from the injectivity.
I think this is correct.
For Hilbert or Banach spaces, the analogous condition to LICQ is that $\frac{\partial g}{\partial x}$ is surjective. Then $\lambda$ exists and is unique.