I am wondering whether there is any convergence rate of the following approximation and what kind of assumptions shall I impose?
Let $X\in L^2(\Omega;\mathbb{R}^d)$ be a random variable and $g: \mathbb{R}^d\to \mathbb{R}$ be a given Lipschitz function. Suppose $\{h_k(X)\}_{k=1}^\infty$ is a set of orthogonal basis of $L^2(\Omega;\mathbb{R})$. Let us first consider $h_k$ be indicator functions of disjoint hypercubes in $\mathbb{R}^d$. Then I would like to approximate $g(X)$ using a finite number of these basis by setting $g(X)\approx \sum_{k=1}^M a_kh_k(X)$ where $a=\{a_k\}$ solves $$ a=\arg \min_{b\in\mathbb{R}^M}E[|g(X)-\sum_{k=1}^Mb_kh_k(X)|^2]. \label{a}\tag{1} $$
Since $a$ may not be able to be computed analytically, I use N independent sample path of $X$ to approximate \eqref{a}, which results in the estimator $\hat{a}$ $$ \hat{a}=\arg \min_{b\in\mathbb{R}^M}\frac{1}{N}\sum_{i=1}^N[|g(X_i)-\sum_{k=1}^Mb_kh_k(X_i)|^2]. \label{b}\tag{2}. $$
Question: I would like to know whether there is any convergence rate for $\hat{a}\to a$? What kind of assumptions shall I use?
Or can I show $$E[\sum_{k=1}^M\hat{a}_kh_k(X_i)-\sum_{k=1}^Ma_kh_k(X_i)|^2]$$ converges in certain order?
Check out
https://www.researchgate.net/publication/257457982_Quantitative_error_estimates_for_a_least-squares_Monte_Carlo_algorithm_for_American_option_pricing
and let me know if you have questions about it.
EDIT: In particular, check out Equation (25) in Theorem 3.3.
Basically, the error estimate depends on two terms:
The first corresponds to the typical Monte Carlo estimates and roughly behaves like $\sqrt{M/N}$, where $N$ is the number of samples and $M$ is the number of basis functions (in the notation of the paper, $M$ is $\nu-1$. Note that $\nu$ is the VC dimension, but this just equals the number of basis functions minus 1 in your setting)
The second corresponds to how good your basis functions are able to describe $g$. For example, the span of indicator functions of a grid-type decomposition of your domain with mesh size $h$ is able to approximate a Lipschitz function up to an error of size $\mathcal{O}(h)$. (Note that the number of indicator functions used for this is $M=\mathcal{O}(h^{-d})$)
SUMMARY: If you choose the width $h$ of the grid that defines your indicator functions to be $h:=N^{-\alpha}$, then approximately
$$ E[|g(X)-\sum_{k=1}^{M=h^{-d}=N^{\alpha d}}h_{i,h}(X)|^2]\leq \sqrt{M/N}+h=\sqrt{N^{\alpha d-1}}+N^{-\alpha} $$ To minimize this, you should thus let $\alpha:=1/(d+2)$, which gives you an overall error of $\mathcal{O}(N^{-1/(d+2)})$