The Original Question (informal):
I have a black-box linear system $f:\mathbb{R}^N\rightarrow\mathbb{R}^M~(0\ll M\ll N)$. It is guaranteed that there exists a full-rank matrix $\mathbf{A}\in\mathbb{R}^{M\times N}$ satisfying $\forall \mathbf{x}\in\mathbb{R}^N,f(\mathbf{x})\equiv\mathbf{Ax}$. Here I call $\mathbf{A}$ the transformation matrix of $f$.
Now given a vector $\mathbf{y}\in\mathbb{R}^M$, I want to calculate $\mathbf{z}=\mathbf{A}^+\mathbf{y}\in\mathbb{R}^N$, where $\mathbf{A}^+\in\mathbb{R}^{N\times M}$, satisfying $\mathbf{AA}^+=\mathbf{I}_M$ in some cases, is the Moore–Penrose inverse of $\mathbf{A}$ (Reference 1).
There are some supplementary information:
I may know that there is a way to get the explicit form of $\mathbf{A}$ from a given fixed $f$ (Reference 2). However, since $M$ and $N$ are both very large numbers (about $10^6$ or even $10^9$), I would not get $\mathbf{A}$, explicitly calculate $\mathbf{A}^+$, and finally calculate $\mathbf{z=A}^+\mathbf{y}$. (The computational complexity of direct calculation is too high for me.) I may want an indirect way of obtaining $\mathbf{z=A}^+\mathbf{y}$.
The inner structure of linear $f$ is very complicated. Actually, $f$ is a black-box. But in my system, I can conveniently calculate $f(\mathbf{r})$ for any given $\mathbf{r}\in\mathbb{R}^N$. In other words, the forward pass of $f$ is fast and efficient.
I may not need $\mathbf{A}$, $\mathbf{A}^+$ or some operators about them. I may only want $\mathbf{z=A}^+\mathbf{y}$, which is known to be unique when $\mathbf{y}$ and $f$ are given and fixed.
There are no prior knowledge about $f$, $\mathbf{A}$ and $\mathbf{y}$. In other words, their inner values or elements can be random numbers, like some noise.
So, how to get $\mathbf{z=A}^+\mathbf{y}$ efficiently?
Some of My Efforts:
I was trying to search an $\mathbf{x}$ satisfying $\mathbf{Ax=y}$. To be concrete, I use a tool dubbed PyTorch (a deep learning framework on Python) to optimize a randomly initialized $\mathbf{x}$ with a loss function $\mathcal{L}=\lVert f(\mathbf{x})-\mathbf{y} \rVert _2^2$. And $f$ is a neural network in my own implementation. When $\mathcal{L}$ hits $0$, I stop my optimization program and get the estimation.
However, since $0\ll M\ll N$, there may exist $\mathbf{x}_1$ and $\mathbf{x}_2$ satisfying $\mathbf{x}_1\ne \mathbf{x}_2$ and $\mathbf{Ax}_1=\mathbf{Ax}_2=\mathbf{y}$. Therefore, I think this method could not exactly find the unique $\mathbf{z=A}^+\mathbf{y}$ that I want.
Does there exists an efficient way to achieve this?
There may be two statements (but in fact, only one of them is true):
The answer is "Yes". There exists a way to efficiently calculate $\mathbf{z=A}^+\mathbf{y}$ from given fixed $f$ and $\mathbf{y}$, without accessing the explicit forms of $\mathbf{A}$ or $\mathbf{A}^+$. In other words, there are some properties of $\mathbf{z=A}^+\mathbf{y}$ can be utilized. But I have still not found them.
The answer is "No". To get $\mathbf{z=A}^+\mathbf{y}$, I should try to get the explicit form of $\mathbf{A}^+$ and then calculate $\mathbf{A}^+\mathbf{y}$. There are no efficient and indirect ways.
After a long struggle, I still have no idea about this problem. Now I may tend to believe that the above second statement is true.
Any solutions, suggestions and discussions about this problem would be appealing for me.
I am still searching, trying and thinking ...
The Wikipedia article on Moore-Penrose inverses includes the following formula, valid if $\mathbf A$ has linearly independent rows: $$\mathbf A^+ = \mathbf A^T (\mathbf A \mathbf A^T)^{-1}$$ where I've replaced their $\mathbf A^*$ with $\mathbf A^T$ because your matrix entries are real-valued.
You said we can assume $\mathbf A$ has full rank, so this formula is valid in your case. (There's also a separate formula that instead assumes $\mathbf A$ has l.i. columns, which would obviously be false since you have $M \ll N$.)
You can use this formula to compute $\mathbf z = \mathbf A^+ \mathbf y$ without building $A^+$. In particular:
Step (2) is easy if your black box allows you some way to efficiently compute $\mathbf A^T \mathbf v$ for an input vector $\mathbf v \in \mathbf{R}^M$.
Step (1) looks hard, but actually this is a well-studied problem; see the Wikipedia article on "matrix-free methods" which exactly refers to finding approximate solutions to linear systems when we only access the matrix as a black box. These methods mostly come down to multiplying by the linear system's matrix a bunch of times. In your case, to multiply by $\mathbf A \mathbf A^T$ you should not build the full matrix; instead you could take a starting vector $\mathbf u_0$, then compute $\mathbf u_1 = \mathbf A^T \mathbf u_0$, then finally compute $u_2 = \mathbf A \mathbf u_1$. The step $\mathbf u_1 \mapsto \mathbf u_2$ is just an application of your black box function $f$. The step $\mathbf u_0 \mapsto \mathbf u_1$ again relies on you having some relatively efficient method for applying $\mathbf v \mapsto \mathbf A^T \mathbf v$.
In summary, you should look at the details of how you're computing $f$ and see if you can set up a similar black-box way to efficiently compute $\mathbf A^T \mathbf v$ for input vectors $\mathbf v \in \mathbb{R}^M$. If you can do that, then the above strategy will let you multiply by $\mathbf A^+$ way more efficiently than fully building+storing $\mathbf A^+$.