I am trying to wrap my head around some concepts of Reproducing Kernel Hilbert Spaces (RKHS) without having a formal background in functional analysis. Since I am trying to form an intuition about what this space is and how it does what it does, I would appreciate it if you could double-check my reasoning.
A RKHS belonging to a kernel $k(x,x')$ (evaluated at $x$, centered on $x'$), $x,x' \in \mathcal{X}$ contains functions of the form
$$f(x)=\sum_{i=1}^{m}\alpha_ik(x,x_i)$$
where $\alpha_i \in \mathbb{R}$ are some coefficients and $m \in \mathbb{N}$ is some number to which $i$ is counting the indices. Now RKHSs have inner products. The rules for inner products state that if I have two vectors $\textbf{a}=[a_1,a_2,...,a_m]$ and $\textbf{b}=[b_1,b_2,...,b_m]$ in some $m$-dimensional vector space, then their inner product $\langle\textbf{a},\textbf{b}\rangle$ is:
$$\langle\textbf{a},\textbf{b}\rangle=\sum_{i=1}^{m}a_ib_i$$
One can see the similarity between the right-hand side of the first and second equation. So if we define $f(\cdot)=[\alpha_1,\alpha_2,...,\alpha_m]$ and $k(x,\cdot)=[k(x,x_1),k(x,x_2),...,k(x,x_m)]$ we could express $f(x)$ as an inner product in a RKHS $\mathcal{H}$:
$$f(x)=\langle f(\cdot),k(x,\cdot)\rangle_\mathcal{H}$$
That's the so-called reproducing property, if I understand correctly. Am I correct so far? If yes, I have a few questions:
If the RKHS is a space, it should have a dimensionality (in our case $m$) and orthonormal bases. If we take a parameter space, for example, each of its dimension has a fixed interpretation (say, a 2-D space with dimensions $x=weight$ and $y=age$ allows for an inner product of two vectors, but both vectors will contain elements of the type $(weight,age)$). Now I can see how the expression in Equation 1 can be interpreted as an inner product, but the two vectors (albeit of equal length) contain different elements: $f(\cdot)$ is a vector of scalar coefficients, and $k(x,\cdot)$ is a vector of functions. They do not seem to share their bases in the same sense that the weight-age-example above would. Does this mean that the RKHS $\mathcal{H}$ is sort of a general-purpose space with no fixed definition as to what its dimensions represent, or is there a different interpretation I am missing?
My second question relates to the dimensionality $m$ (which I adopted from the Wikipedia article). The way I understand it, this dimensionality $m$ relates to the number of elements in the set $\mathcal{X}$. Strictly speaking, a function $f(x)$ defined according to Equation 1 could theoretically contain kernels centered on every element $x' \in \mathcal{X}$, in which case the dimensionality of $\mathcal{H}$ would be as large as the set itself ($m=|\mathcal{X}|$) and possibly infinite if the set $\mathcal{X}$ is infinitely large (e.g., $\mathcal{X}$ is the continuous real line). Wouldn't specifying $|\mathcal{X}|$ as the upper limit of the sum be more general than $m$? If we are only interested in $f(x)$ which are based only on a subset of $\mathcal{X}$ we could still sum over all theoretically possible dimensions and throw out the irrelevant kernels by setting their corresponding entries in $f(\cdot)$ to zero.
Is this right or am I missing something?
The elements of the RKHS $H_k$ associated to a given kernel are actually limits of the functions you mentioned. Let us recall the end goal: you want to start from a p.d. symmetric function $k$ and arrive at a Hilbert space of functions $H_k$ where $k$ reproduces the elements of $H_k$ in the sense that $f(x) = \langle f, k(x, \cdot) \rangle$ for every $f$. Let's unpack the last statement. You want to find a vector space of functions (please make sure you understand what it means to treat functions as vectors), then define an inner product $\langle \cdot, \cdot \rangle$ on it, and then verify each evaluation functional is continuous (or bounded, the conditions are equivalent). Finally, we have to check the reproducing property.
Let's work a bit on the definition of inner product as I sense some confusion right around the corner. Inner products are abstract entities, that is, let $V$ be a vector space, then any function $f: V \times V \to \mathbb{R}$ satisfying the following properties is called an inner product:
I used $f$ here just to make it clear to you that there is nothing special in our choice of $\langle \cdot, \cdot \rangle$; it's just notation.
When $v = \mathbb{R}^n$, the dot product ($\langle x, y \rangle = \sum_{i=1}^n x_iy_i$) is the most commonly used inner product. However, there are other inner products!
For example, you can put weights on each coordinate and still end up with an inner product: $\langle x, y \rangle = \sum_{i=1} c_ix_iy_i$ for fixed positive $c_1, c_2, ..., c_n$.
In other finite dimensional vector spaces, you can get even weirder inner products. One example of this is the inner product space of polynomials of degree at most $n$ (you've probably seen this space on your linear algebra class). Here, the inner product is given by an integral: $\langle f,g \rangle = \int_a^b f(x)g(x)dx$.
Having said that, let's proceed.
We can create a vector space, let's call it $H_{k_0}$ ($0$ because it is unfinished, we will have to add some stuff to it) by taking the span of the functions $k_x$ defined by $k_x(y) = k(x,y)$. At this stage, $H_{k_0}$ looks exactly as you've said, all functions are of the form $f(x) = \sum_{i=1}^n \alpha_i k_{x_i}(x)$, but there is a catch: after defining the abstract inner product $\langle \sum_{i=1}^n\alpha_i k_{x_i}, \sum_{j=1}^m \beta_j k_{x_j} \rangle = \sum_{i=1}^n\sum_{j=1}^m \alpha_i\beta_j k(x_i,x_j)$, you may end up with a space that is not complete.
You probably don't have a firm grasp of what completeness means and why it's useful. I'll give just a brief illustrative example so that we can move on. Consider the rational numbers and recall that $\sqrt{2}$ is not a rational number. Back on your school days, you probably played around on finding truncated decimal expansions of irrationals such as $\sqrt{2} = 1.41...$. Let's investigate the following sequence
$r_1 = 1$, $r_2 = 1.4$, $r_3 = 1.41$, ... (and so on)
If you plot (do it!), you'll realize it clearly converges to $\sqrt{2}$, but we don't have this number on $\mathbb{Q}$. In $\mathbb{Q}$, we have the sad reality that loads of "converging" sequences don't have a limit. This complicates stuff a lot and that's why calculus is done with real numbers instead of rational numbers. The formal condition for a vector space to be complete is that, for any sequence $v_n$ of vectors, and for every $\varepsilon$ (as small as it may be), there is a point on the sequence $N$ after which all vectors are $\varepsilon$-close; that is, $\forall \varepsilon: \exists N: \forall n,m \geq N: ||v_n - v_m|| < \varepsilon$. Inner product spaces satisfying this condition are way easier to deal with (they are the infamous Hilbert spaces!).
It is a fact that every inner product space can be completed (like how $\mathbb{R}$ completes $\mathbb{Q}$). I'll leave the details out but you can figure it out by searching for the keyword 'Inner product space completion'. After you've added the missing vectors, you get the more complicated reality where elements of your vector space (now lets call it $H_k$) look like series of the form:
$$f(x) = \sum_{i=1}^\infty \alpha_ik_{x_i}$$
At this point I'd like to congratulate you for noticing that (1) and (2) are indeed connected. Your intuition works just as well for abstract inner product spaces (the ones we'll be working with). To spell it out, suppose $V$ is an inner product space, $f: V \to \mathbb{R}$ is a linear functional, and $e_1, e_2, ..., e_n$ is an orthonormal basis for $V$, then, $f(x) = f(\sum_{i=1}^n \alpha_ie_i) = \sum_{i=1}^n \alpha_if(e_i) = \langle f, z \rangle$, where $z = \sum_{i=1}^n f(e_i)e_i$. Note the need for $e_i$ to be orthonormal; otherwise, it wouldn't work. Furthermore, the $k_x$ are usually not orthonormal, so that your function $f$ is not expressed as the linear combination of orthonormal vectors, and the whole thing falls apart. However, we can use the patched-up argument I just provided to have ourselves satisfied over the existence of a representative $z$ for every functional $f$. This result is known as the Riesz Representation Theorem and is very important in functional analysis in general. The theorem also holds for infinite dimensional vector spaces (where the completion requirement is necessary).
Ok, so I've said a lot.
Time to summarize: we have a Hilbert space where elements are of the form $f(x) = \sum_{i=1}^\infty \alpha_i k_{x_i}(x)$. From here, we would have to verify that the evaluation functionals are continuous and that $f(x) = \langle f, k_x\rangle$, for every $f$. We can verify both of these things at the same time by noticing the following:
$$\langle f, k_x\rangle = \langle \sum_{i=1}^\infty\alpha_i k_{x_i}, k_x\rangle = \sum_{i=1}^\infty \alpha_i \langle k_{x_i}, k_x\rangle = f(x)$$
This shows that the evaluation functional is of the form $\langle \cdot, z \rangle$ (that is, we just fixed one of the entries of the inner product). Since the inner product itself is always continuous, we have that the required functional is continuous as well. The continuity of the inner product also justifies taking out the series in the second equality.
From here, you can finish proving the so-called Moore-Aronszajn theorem which states that for every p.d. symmetric kernel $k$ there is a unique (here is the part I'll omit) RKHS for which the reproducing kernel is $k$.
Now lets tackle your questions.
It has dimensionality, but not necessarily finite. For example, the RKHS associated to the RBF/Gaussian kernel is not finite dimensional. Thus, there might not be a orthonormal basis in the usual sense (take a look at total orthonormal sets, or Hilbert-space bases as I like to call them)
The patched-up proof of the Riesz Representation Theorem should have dissolved the confusion. $f$ is a vector in the span of the $k_x$ and its coordinates are scalars, as is always the case. $k_x$ is a vector, coma. It together with the other $k_{x'}$ generates all of $H_k$.
I'm not sure I understand your question. Anyhow, I'd like to point out that the a lot of modern mathematics is done in an abstract fashion and the situation is no different with RKHS. They are just vector spaces with some additional structure. You can find very strange RKHS out in the math-jungle, ones which have nothing to do with Machine Learning (I guess this is the point you're coming from); so be ware!
Reiterating: $m$ can be an integer or an arbitrary cardinal number such as $\aleph_1$. Its also important to note that in infinite dimension Hilbert spaces you have two notions of dimension: algebraic dimension (the one you are used to), which is almost useless and orthogonal dimension, which has to do with the total orthonormal sets I mentioned earlier. This, on the contrary, are quite useful.
Yes.
I couldn't make sense of this.
All in all, I sense some confusion around the definitions. I hope my answer helps in that direction.
Regarding your questions at the end, the don't seem to be very important, there likely are questions way more important than those which you can't answer. For instance, if you indeed came in contact with RKHS through ML, can you justify to yourself why the algorithms keep working when you swap inner products by kernels? Moreover, do you understand the representer theorem and why the construction we carried above helps to solve the potentially infinite dimensional optimization problem in risk minimization?