I am reading my graft and the document of David Haussler about Convolution Kernels on Discrete Structures, UCSC-CRL-99-10.
My graft

and the other document


The terminology seems to differ. The other document is from Computer Science department so I cannot trust it 100% at the moment. They call one instance of $\Phi$ kernel, $K$. I call the kernel $\sigma$. It seems that I can also take a series of "kernels" and call only one kernel.
The word convolution kernel caught my eye. I think the kernel of Wigner-Ville distribution is one of them. Is it?
Why are they taking a series of "kernels"? My interpretation can be false. What is the difference between the two kernel -definitions?
Your first document considers both the specific Wigner-Ville "kernel", described (at least heuristically) by a formula. Convergence is an issue. Any (bilinear...) map on pairs of functions (heuristically) written in such a fashion (with something else in place of the $x(t-\tau)\,x(t+\tau)$ etc. in the W-V kernel) is usually called a "kernel map", and/or the thing replacing the W-V's $x(t-\tau)\,x(t+\tau)$ is "the kernel".
Your first document also mentions the space of tempered distributions "in two variables", $S'(\mathbb R^{2n})$ as a/the space of "kernels". And, indeed, Schwartz' Kernel Theorem shows that any continuous linear map $T:S(\mathbb R^n)\to S(\mathbb R^n)$ is given by $T(f)(g)=K(f\otimes g)$ for some $K\in S'(\mathbb R^{2n})$. And this gives a way to tweak the W-V kernel, etc. Thinking of tempered distributions as generalized functions, we might write $K(x,y)$ to specify $K$ itself, and imagine integration against $f(x)\,g(y)$.
In general, a "convolution kernel" $K(x,y)$ is a two-variable function of the special form $K(x,y)=F(x-y)$. That is, kernels can be made from one-variable functions by convolution. Not all.
Your "other source" is addressing just a very special class of "kernels", made in a special way, and is a bit sloppy about notation and terminology. But it is still a special case of the far more general idea that a kernel is a distribution of some kind:
The notation and imprecise language easily gives the wrong impression about what the $\Phi_n$'s are, in the notation of that "other" doc. It would be more accurate to take $\Phi_n$'s an orthonormal basis for some Hilbert space, and write $\Phi(x,y)=\sum_n \Phi_n(x)\Phi_n(y)$. Yes, it is possible to interpret the latter expression as some sort of inner product, in the $n$ variable, but that is very misleading, and irrelevant. Then the corresponding operator is $Tf(y)=\sum_n \langle f,\Phi_n\rangle\cdot \Phi_n(y)$
One of the features of this kind of operator is the positivity $\langle Tf,f\rangle\ge 0$ for all $f$. Also, it is symmetric in the sense that $\Phi(y,x)=\Phi(x,y)$, ... though with pointwise convergence issues we might prefer to say $\langle Tf,g\rangle=\langle f,Tg\rangle$ for all $f,g$ (keeping in mind that the scalars are real, not complex... )