I have seen several proof of Schwartz Kernel Theorem, using different techniques. Some (such as Melrose's proof in his notes on microlocal analysis) use the representations of $\mathcal{S}(\mathbb{R}^n)$ and $\mathcal{S}'(\mathbb{R}^n)$ in terms of weighted Sobolev spaces, others (such as the proof in Duistermaat and Kolk) use the Fourier transform, others (such as the one in Friedlander and Joshi) use Fourier series.
I can follow these proofs, but I feel I don't really understand them, in that I don't understand what fundamental properties of the space of distributions make them work.
I see that there are similarities: for example, the last two approaches use some sort of representation of test functions on $X\times Y$ into sums of tensor products of test functions on $X$ and $Y$.
I found this remark in an old paper of Ehrenpreis (On the Theory of Kernels of Schwartz, Proceedings of the American Mathematical Society, Vol. 7, No. 4 (Aug., 1956), pp. 713-718):
Lemma 1 is the only part of the proof of Theorem 1 [the kernel theorem] that uses special properties of the space $\mathcal{D}$ and, in fact, the analog of Theorem 1 [the kernel theorem] holds for (essentially) all function spaces for which an analog of Lemma 1 can be found.
Lemma 1 is the following
Let $B$ be a bounded set in $\mathcal{D}(\mathbb{R}^n\times\mathbb{R}^n)$. Then we can find a bounded set $B'\subset\mathcal{D}(\mathbb{R}^n)$ and a $b>0$ so that every $f\in B$ can be written in the form $\sum_i \lambda_ig_i\otimes h_i$ where $\sum_i|\lambda_i|<b$, and $g_i, h_i\in B'$, and where the series converges in $\mathcal{D}(\mathbb{R}^n\times\mathbb{R}^n)$.
The remark would suggest that the key point really is being able to decompose test functions on $X\times Y$ into sums of tensor products of test functions on $X$ and $Y$, but I still don't see why this should be the case.
I also read that the theory of Nuclear Spaces proves an abstract kernel theorem, generalising the usual statement for distributions. I assume this implies being able to extract the fundamental properties that make the kernel theorem work, but I found no short and essential exposition of the theory, or one which does not require extensive prerequisites.
So, my questions are:
- How do people who understand the kernel theorem think about its proof?
- What are the fundamental ingredients that make it work?
- I understand why it is important, but why is it so surprising that every continuous linear map $\mathcal{D}\rightarrow\mathcal{D}'$ is given by a kernel?
Depends what you call the Kernel Theorem. The full version is that the map $$ \mathcal{D}'(\mathbb{R}^{m+n}) \rightarrow {\rm Hom}(\mathcal{D}(\mathbb{R}^m),\mathcal{D}'(\mathbb{R}^n)) $$ $$ T\mapsto(f\mapsto (g\mapsto T(f\otimes g)) ) $$ is a topological vector space isomorphism. Here $f(x)$ is a test function in $\mathcal{D}(\mathbb{R}^m)$, $g(y)$ is a test function in $\mathcal{D}(\mathbb{R}^n)$ and $f\otimes g$ denotes the test function in $\mathcal{D}(\mathbb{R}^{m+n})$ given by $(x,y)\mapsto f(x)g(y)$. The spaces of distributions $\mathcal{D}'(\mathbb{R}^{m+n})$ and $\mathcal{D}'(\mathbb{R}^n)$ must be given the proper topology, ie., the strong topology and not the weak-star. The space ${\rm Hom}(\mathcal{D}(\mathbb{R}^m),\mathcal{D}'(\mathbb{R}^n))$ is the space of continuous (in the usual point set topology sense, not that of sequential continuity) linear maps from $\mathcal{D}(\mathbb{R}^m)$ to $\mathcal{D}'(\mathbb{R}^n)$. The topology on this $\rm Hom$ is the one defined by the seminorms $$ ||\varphi||=\sup_{f\in A}\rho(\phi(f)) $$ where $A$ ranges over bounded sets in $\mathcal{D}(\mathbb{R}^m)$ and $\rho$ over continuous seminorms of $\mathcal{D}'(\mathbb{R}^n)$. Equivalently, you can take the seminorms $$ ||\varphi||=\sup_{f\in A, g\in B}|\phi(f)(g)| $$ where $A$ ranges over bounded sets in $\mathcal{D}(\mathbb{R}^m)$ and $B$ ranges over bounded sets in $\mathcal{D}(\mathbb{R}^n)$.
To truly understand the theorem, you need to first consider the simpler case with $\mathcal{S},\mathcal{S}'$ instead of $\mathcal{D},\mathcal{D}'$. This in turn requires the understanding of the discrete toy model given by spaces of sequences.
Let $\mathbb{N}=\{0,1,2,\ldots\}$. We denote by $s(\mathbb{N}^m)$ the space of (multi)sequences $u=(u_{\alpha})$ indexed by multiindices $\alpha\in\mathbb{N}^m$ for which the following quantities are finite $$ ||u||_k=\sup_\alpha \langle\alpha\rangle^k|u_{\alpha}| $$ for all $k\in\mathbb{N}$. Here I used the Japanese bracket $\langle\alpha\rangle=\sqrt{1+\alpha_1^2+\cdots+\alpha_m^2}$. We use the above seminorms to define the topology of this space of rapidly decaying multisequences.
Then we define the space $s'(\mathbb{N}^m)$ of multisequences of moderate growth, i.e., multisequences $v=(v_{\alpha})_{\alpha\in\mathbb{N}^m}$ for which there exists $k\in\mathbb{N}$ and $C\ge 0$ such that for all $\alpha$ $$ |v_{\alpha}|\le C\langle\alpha\rangle^k\ . $$ It can be identified with the topological dual of $s(\mathbb{N}^m)$ via the obvious pairing $$ (v,u)\mapsto \sum_{\alpha\in\mathbb{N}^m}v_{\alpha} u_{\alpha}\ . $$ The correct (strong) topology on this topological dual becomes, at the level of its concrete representation $s'(\mathbb{N}^m)$, the topology generated by the seminorms $$ ||v||_u=\sup_{\alpha\in\mathbb{N}^m} u_{\alpha} |v_{\alpha}| $$ indexed by elements $u$ of $s(\mathbb{N}^m)$ with non negative entries.
One can now state the toy kernel theorem in exactly the same way as before. Namely, the map $$ \mathcal{s}'(\mathbb{N}^{m+n}) \rightarrow {\rm Hom}(\mathcal{s}(\mathbb{N}^m),\mathcal{s}'(\mathbb{N}^n)) $$ $$ v\mapsto(u\mapsto (\sum_{\alpha\in\mathbb{N}^m} v_{\alpha,\beta}u_\alpha)_{\beta\in\mathbb{N}^n} ) $$ is a topological vector space isomorphism. The proof is a bit long but elementary. If you work it out by yourself, you will have understood the kernel theorem. Indeed, using Hermite functions and the resulting isomorphisms with multisequence spaces, the above toy model kernel theorem implies the one for $\mathcal{S},\mathcal{S}'$.
The key facts needed for the toy theorem are:
If $\mathcal{S},\mathcal{S}'$ is not enough for you and you insist on $\mathcal{D},\mathcal{D}'$. You can also do it with multimatrices (instead of multisequences), but that's quite a bit more work since you will need the results of this article by Bargetz.