Functional gradient and inner product in machine learning

Question

Functional gradient and inner product in machine learning

276 Views Asked by Bumbble Comm At 11 May 2026 - 3:44

In several applications, such as machine learning, the following setup arises:

Let $V$ denote a function space equipped with the following inner product $ \langle f, g\rangle = \sum_{i=1}^{m} f(x_{i}) g(x_{i}),$ where the functions are real-valued and we only observe the image of $x_{i}$ for $i=1,2,\dots,m.$ Now, define a functional $\ell : V \to \mathbb{R}$ by $\ell(f) = \sum_{i=1}^{m} l(y_{i},f(x_{i})),$ where the $y_{i}$ denote some fixed real numbers and $l$ is usually assumed to be convex and differentiable in the second argument, such as the quadratic error $l(y_{i},f(x_{i})) = (y_{i} -f(x_{i}))^{2}.$

Typically, this functional is treated as a function of $m$ real variables based on the right-hand side. So, the functional gradient is treated as an element of $\mathbb{R}^{m},$ i.e., $$\nabla \ell (f) = \begin{pmatrix} \frac{\partial \ell(f)}{\partial f(x_{1})} \\ \frac{\partial \ell(f)}{\partial f(x_{2})} \\ \vdots \\ \frac{\partial \ell(f)}{\partial f(x_{m})} \end{pmatrix} \in \mathbb{R}^{m}.$$

Now, what I want to compute is $\sum_{i=1}^{m} \frac{\partial \ell(f)}{\partial f(x_{i})} g(x_{i}).$ Can this be written as $ \langle \nabla \ell (f), g \rangle,$ or, do I need to consider the standard inner product, i.e., dot product, on $\mathbb{R}^{m},$ where $g$ would be a vector with components $g(x_{i})$ for $i=1,2,\dots,m?$

Part of my confusion with $ \langle \nabla \ell (f), g \rangle$ is that I don't know how to check the right-hand side of the definition: is $\nabla \ell (f)$ an element of $V?$

Of note, sometimes this is done with the Gateaux derivative, but I am less familiar with this type of derivative. Also, the main aim is to get to gradient descent, which explains the simplification to finite-dimensions.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Actually, I think the Gateaux derivative is probably the best way to understand this. But few people teach multivariable calculus so that Gateaux derivatives are an obvious generalization. :(

The key point to understand is that derivatives aren't really a number, they're a best linear approximant. In calculus, you usually see $$f(x)=f(x_0)+f'(x_0)(x-x_0)+\dots$$ where $f'(x_0)$ is a number. In multivariable calculus, this turns into $$\vec{f}(\vec{x})=\vec{f}(\vec{x_0})+(\mathcal{D}\vec{f})(\vec{x_0})(\vec{x} - \vec{x_0})+\dots$$ where $(\mathcal{D}f)(x_0)$ is a matrix. This is because, in each case, the derivative is trying to write $f$ as a constant plus a linear part. A linear function from $\mathbb{R}^1$ to $\mathbb{R}^1$ is just multiplication by a number; a linear function from $\mathbb{R}^m$ to $\mathbb{R}^n$ is multiplication by a matrix.

You've also seen in linear algebra that a matrix representation is always just a sort of convenience for arithmetic: the same transformation has different representations in different bases. So maybe we should instead write that expansion in terms of the underlying linear transformation: $$\vec{f}(\vec{x})=\vec{f}(\vec{x_0})+\mathcal{D}(\vec{f})(\vec{x_0})(\vec{x} - \vec{x_0})+\dots\tag{1}$$ where $\mathcal{D}(\vec{f})(\vec{x_0})$ is the corresponding linear transformation. Now (1) is nicely coordinate-independent.

Of course, nobody writes that, because those parentheses are insane. But if we take a step back, we can see that each of those parenthesis is serving a purpose:

$\mathcal{D}$ is the derivative: a map from smooth functions to ⟨something or other⟩
$\mathcal{D}(\vec{f})$ is the derivative of a particular smooth function: $f$. But we haven't specified where to take the derivative at, so $\mathcal{D}(\vec{f})$ is a map from particular points to ⟨something or other⟩
$\mathcal{D}(\vec{f})(\vec{x_0})$ now tells us we want the derivative at $x_0$. This is a particular linear transformation, so it is a function $\mathbb{R}^m\to\mathbb{R}^n$.

Putting it together, $$\mathcal{D}\in C^{\infty}\to(\mathbb{R}^m\to(\mathbb{R}^m\to\mathbb{R}^n))$$ In your case, the gradient, $n=1$, so we can just write: $$\nabla\in C^{\infty}\to(\mathbb{R}^m\to(\mathbb{R}^m\to\mathbb{R}))$$ There's a convenient notation for this: for a vector space $V$, let $V^*$ be the space of linear functions $V\to\mathbb{R}$. Then \begin{gather*} \nabla\in C^{\infty}\to(\mathbb{R}^m\to(\mathbb{R}^m)^*) \\ \nabla l\in\mathbb{R}^m\to(\mathbb{R}^m)^* \\ (\nabla l)(f)\in(\mathbb{R}^m)^* \end{gather*}

So what is this $V^*$? Well, recall that the defining characteristic of a Hilbert space is an inner product. This inner product gives you a way to turn elements of $v$ into elements of $V^*$: if $v\in V$, then $(w\mapsto\langle v,w\rangle)\in V^*$. (In coordinates, this is the transpose.) In fact, the transpose goes both ways: all the elements of $V^*$ are of this form. So it's common to identify $V$ with $V^*$ (although you cannot do this if you have more than one Hilbert space at once). Once you identify $V$ with $V^*$, then your inner product can take two elements of $V$…or an element of $V$ and an element of $V^*$…or an element of $V^*$ and an element of $V$…or two elements of $V^*$. In the $V^*\times V$ case, you can unravel what I said above to get $$\langle\phi,v\rangle=\phi(v)$$

Based on (1), what you want to show is actually $$(\nabla l)(f)(g)=\sum_j{\frac{\partial l(f)}{\partial f(x_j)}g(x_j)}$$ But nobody writes that, because we're allergic to parentheses, and we write the equivalent $$\langle(\nabla l)(f),g\rangle=\sum_j{\frac{\partial l(f)}{\partial f(x_j)}g(x_j)}$$ instead.

Functional gradient and inner product in machine learning

There are 1 best solutions below

Related Questions in INNER-PRODUCTS

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Related Questions in FUNCTION-SPACES

Trending Questions

Popular # Hahtags

Popular Questions