In his online lectures on Computational Science, Prof. Gilbert Strang often interprets divergence as the "transpose" of the gradient, for example here (at 32:30), however he does not explain the reason.
How is it that the divergence can be interpreted as the transpose of the gradient?

A "dual pair" in functional analysis consists of a topological vector space E and its dual space $E'$ of continuous linear functionals, or some subspace of this.
That is for real vector spaces, for every element $e \in E$ and $e' \in E'$, we can write $$ \langle e, e' \rangle \in \mathbb{R} $$ Example: Let $E$ be a Hilbert space, then $E = E'$ and the dual pairing is given by the scalar product.
In the case at hand we have two function spaces and the dual pairing is defined to be $$ \int_{\Omega} u(x, y) v(x, y) d x d y $$ When you have some operator $$ T: E \to E $$ it is often possible to define the "transposed operator" T' to be the operator $$ T': E' \to E' $$ by the requirement that $$ \langle T e, e' \rangle = \langle e, T' e' \rangle $$ for all e, e'. In the context of Hilbert spaces, it is more common to talk about "adjoint operators". The name "transpose" is motivated by the fact that for linear operators on finite dimensional vector spaces, the transpose is given by the transposed (conjugate, for complex ground field) matrix of the matrix that represents $T$ with respect to a fixed basis.
In the case at hand, when we write down $$ \int_{\Omega} (- div \; grad u(x, y)) v(x, y) d x d y $$ you'll see that this is the same as $$ \int_{\Omega} (grad \; u(x, y)) \cdot (grad \; v(x, y)) d x d y $$ by integration by parts, if the boundary terms are zero. The $\cdot$ denotes the canonical scalar product of vectors in $\mathbb{R}^n$. So, if the boundary terms are zero, we have
$$ \langle - div \; e, e' \rangle = \langle e, grad \; e' \rangle $$ where - strictly speaking - the dual pairing on each side is different, because the first is a dual pairing of functions with values in $\mathbb{R}$, while the second is for functions with values in $\mathbb{R}^2$. But neglecting this technical detail, the operator grad is in this sense the transposed operator of the operator div.