Why does the gradient commute with taking expectation?

Question

Why does the gradient commute with taking expectation?

2.2k Views Asked by Bumbble Comm At 02 Apr 2026 - 6:38

Let $X, Y$ be two random variables, with $X$ taking values in $\Bbb R^n$ and $Y$ taking values in $\Bbb R$.

Then we can look at the function $h: \Bbb R^n \to \Bbb R$ given by $$\beta \mapsto \Bbb E[(Y-X^T\beta)^2]$$ It is claimed that the gradient of $h$ is given by $$\nabla h = \Bbb E[2X(X^T\beta-Y)]$$

This seems like a special case of the identity

$$\nabla \Bbb E[f]=\Bbb E [\nabla f]$$

Where the expectation is taken over the mutual distribution of some random variables.

Formally, We want the following: Suppose $X_1,...,X_m$ are random variables returning values in some sets $A_i$ with some given mutual probability distribution. Then for every function $f: \Bbb R^n \times \prod A_i \to \Bbb R$, for every $\beta \in \Bbb R^n$ we can form the random variable $f(\beta, X_1,...,X_m)$ and take its expectation. Taking different values of $\beta$ gives rise to a function $\Bbb R^n \to \Bbb R$. We claim that its gradient is equal to the vector obtained by first fixing the values of $X_1,...,X_m$ and taking the gradient of the resulting function $\Bbb R^n \to \Bbb R$, and this gives a random variable returning values in $\Bbb R^n$, for which we can take the expectation.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 09 Feb 2017 - 8:31

$\beta$ is not a random variable so you can expand the expression and take $\beta$ out as a factor. Then differentiate that expression with respect to $\beta$ and check that the claimed equality holds. It is not more complicated than that. It would be a problem if you couldn't factor $\beta$ out (e.g. $e^{X^{\top}\beta}$). Then you would indeed have to justify interchanging integration and differentiation.

**Bumbble Comm** · Accepted Answer

That is not the most general version. It is just convenient for the proof. It basically says, you may differentiate under expectation.

Consider $f:\Omega \times U \to \mathbb R$, where $(\Omega, \mathcal A, P)$ is some probability space (Or a $\sigma$ finite measure space if you like) and $U\subseteq \mathbb R^n$ is open and contains $0$.

Assume for (almost) every $\omega\in\Omega$ the function $z \mapsto f(\omega, z)$ is differentiable and define $g(\omega, z) = D_{z} f(\omega, z)$.
Assume there is a $P$-integrable function $S:\Omega \to \mathbb R$ with $\|g(\;.\;, z)\| \le S$ (almost) everywhere for everywhere $z\in U$ and define $G(z) = \int g(\;.\;, z) dP$.
Assume $G$ is continuous at $0$.

Then, $F$ is differentiable at $0$ with $$ D F(0) = G(0). $$

Proof:

For (almost) every $\omega\in\Omega$ we have $\|g(\omega,\;.\;)\| \le S(\omega) < \infty$. That is $f(\omega, \;.\;)$ is lipschitz continuous and the fundamental theorem of calculus applies. Thus, for $z\in U$ we have \begin{align*} F(z) - F(0) - G(0)z &= \int f(\omega, z) - f(\omega,0) - g(\omega,0)z \; dP(\omega) \\ &= \int \int_{0}^1 g(\omega, tz)z - g(\omega, 0)z \; dt\, dP(\omega) \\ &=\biggl(\underbrace{\int \int_{0}^1 g(\omega, tz) - g(\omega, 0) \; dt \, dP(\omega)}_{R(z)} \biggr) z. \end{align*} Since $$ \int\int_0^1 \| g(\omega, tz) - g(\omega, 0) \| \; dt \, dP(\omega) \le \int\int_0^1 2 S(\omega) \; dt \, dP(\omega) = 2 \int S \, dP < \infty$$ Fubini's theorem applies and for $z\to 0$ we have $$ R(z) = \int_0^1 \int g(\omega, tz) - g(\omega, 0)\; dP(\omega) \, dt = \int_0^1 G(tz) - G(0) \, dt \to 0. $$ That is, $F$ is differentiable at $0$ with $$ DF(0) = G(0). $$

Notes:

The assumptions are obviously satisfied in your case.
We can relax assumption 1 and 2 to: Assume for almost every $\omega\in\Omega$ the function $f(\omega,\;.\;)$ is locally absolutely continuous with $$ \sup_{z\in U} \int \| D_z f(\;.\;, z) \| \, dP < \infty. $$ However, one need to be careful about the domain of $D_z f$.
We can replace the Fubini type argument with a Dominated Convergence type argument with a slightly stronger assumption 3. This would allow every measure instead of only $\sigma$ finite one.

Why does the gradient commute with taking expectation?

There are 2 best solutions below

Related Questions in PROBABILITY-THEORY

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in EXPECTATION

Related Questions in VECTOR-ANALYSIS

Trending Questions

Popular # Hahtags

Popular Questions