The problem has the general form of
$\mathbf{AX} = \mathbf{B}$
I am trying to estimate the matrix $\mathbf{A}$ ($m\times n$ where $m<n$) which describes the behavior of a physical system. I have some input stored as the columns of the matrix $\mathbf{X}$ ($n\times k$) and the output stored in the columns of the matrix $\mathbf{B}$ ($m\times k$).
I already have a good initial guess for $\mathbf{A}$, but it can definitely be improved.
I have been trying to use an objective function: minimize: $||\mathbf{AX} - \mathbf{B}||^2 + \gamma ||\mathbf{A} - \mathbf{\hat{A}}||^2$, where $\gamma$ is a regularization parameter ($\gamma > 0$), and $\mathbf{\hat{A}}$ is my initial guess.
However, I am not that familiar with linear algebra, and I therefore have some difficulties with solving the problem or indeed finding information on how to solve it. Initially I would think that I would have to differentiate the objective function with respect to $\mathbf{A}$, setting that equal to $0$, and then solving for $\mathbf{A}$, but even that causes problems due to my lacking knowledge of differentiation rules with matrices.
Hope you can help me
Update: Let me be more concrete
The entire setup is a hyperspectral thermal camera: A thermal camera where a scanning Fabry-Pérot interferometer (SFPI) has been placed in front of the lens. The distance between the mirrors of the SFPI determines the distribution of wavelengths entering the camera. $\mathbf{A}$ is a matrix describing the transmission of the SFPI, where each row represent a given mirror separation and the columns represent the wavenumber of the incident light. This is why I have a good initial estimate for $\mathbf{A}$ given by the Airy function.
$$T = \frac{1}{1 + F\sin^2{2\pi d k}}$$ where $d$ is the mirror separation and $k$ is the wavenumber of the incident light. $F$ is the Coefficient of Finesse described by the mirror reflectance, $r$: $$F = \frac{4r^2}{(1-r^2)^2}$$ Using a reflectance of $r=0.8$ this gives rise to an initial guess matrix $\mathbf{\hat{A}}$ as given in the figure below (both with physical units and with the pure row/col indices):
Each column of $\mathbf{X}$ is a transmission FTIR measurement of a sample (multiplied with a 150 °C black body distribution to correspond with experimental setup). The columns of $\mathbf{B}$ are then the corresponding interferograms recorded by the hyperspectral camera.

Solving directly
When I try to solve the problem directly using the equation supplied by @greg:
$$\mathbf{A} = (\mathbf{B}\mathbf{X}^T + \gamma_{init}\mathbf{\hat{A}})(\mathbf{X}\mathbf{X}^T + \gamma_{reg} \mathbf{M})^{-1}$$
I am able to get a promising result if I use a 2$^\textrm{nd}$ order regularisation to suppress oscillations in the result by setting
$$\mathbf{M} = \begin{bmatrix} 1 & -1 & 0 & 0 & 0 & \dots \\ -1 & 2 & -1 & 0 & 0 & \dots\\ 0 & -1 & 2 & -1 & 0 & \dots\\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots\\ \dots & 0 & 0 & -1 & 2 & -1\\ \dots & 0 & 0 & 0 & -1 & 1\\ \end{bmatrix}$$
Even if I use a random matrix as an initial guess, I am able to get a solution, which resembles the Airy matrix (using $\gamma_{init} = 10^{-7}$ and $\gamma_{reg} = 10^{-3}$). Here I get a reconstruction error og $||\mathbf{AX}-\mathbf{B}||=84314$. The following figure represents the solution (displayed in gray scale) found using this method with the Airy matrix overlayed on top.
Even though the initial guess is random, it still appears to find a solution which resembles the airy function. This to me indicates that we're onto something. Notice how the maximum of the fit is slightly shifted compared to the Airy function - this is the kind of 'improvements' which give me more insight in the actual behaviour of the system. However, there are some flaws... most notably that some of the entries in the matrix are negative, which I would like to suppress.
Solving iteratively
I then tried to implement the iterative solution also proposed by @greg using the Barzilai-Borwein stepper based on his pseudo code. This does indeed promote non-negative solutions, but the sultions I am able to find does not really change the shape of the initial Airy function. Depending on the weights ($\gamma_{init}$ and $\gamma_{reg}$) I get solutions which are somewhere in between the Airy function (with added noise) or a similar solution to what I get from solving the problem directly if I set $\mathbf{M} = \mathbf{I}$. This is illustrated in the figure below where once again, the grayscale image represent the solution for $\mathbf{A}$ and the yellow overlay is the maximum of the Airy function which has here been used as the initial guess.
When setting $\gamma_{init} = 10^{1}$ and $\gamma_{reg} = 10^{3}$ I get the following evolution of $||\mathbf{G}||$ and the reconstruction error $||\mathbf{AX}-\mathbf{B}||$:
So while $||\mathbf{G}||$ is decreasing as expected, the reconstruction error only changes slightly... And for the worse. Does this indicate a wrong implementation of the solver from my part? I should add that for other configurations of $\gamma_{init}$ and $\gamma_{reg}$, the reconstruction error does decrease, but the change is still not that substantial. For example going from $||\mathbf{AX}-\mathbf{B}|| = 2.52932\times 10^5$ to $2.52927\times 10^5$ after 250 iterations where it appears to flatten out.
I guess my general question is: If my implementaiton of the iterative solver is correct, is there a way to guide it more towards the solution found when solved directly with the 2$^\textrm{nd}$ order regularisation?




$ \def\L{\left} \def\R{\right} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\L(#1\R)}} \def\g{\gamma} \def\l{\lambda} \def\A{\widehat A} \def\e{\exp_\bullet} \def\f{\log_\bullet} $Calculate the gradient of the objective function as follows $$\eqalign{ \phi &= \tfrac12\L(AX-B\R):\L(AX-B\R) + \tfrac{\g}2\L(A-\A\R):\L(A-\A\R) \\ d\phi &=\L(AX-B\R):\L(dA\:X\R) + \g\L(A-\A\R):dA \\ &=\L(AXX^T-BX^T + \g A-\g\A\R):dA \\ \\ \frac{\partial\phi}{\partial A} &= A\L(XX^T+\g I\R) - \L(BX^T+\g\A\R) \;\doteq\; G \\ }$$ You can set $G=0$ and solve for $A$ directly (via matrix inversion) $$A = \L(BX^T+\g\A\R) \L(XX^T+\g I\R)^{-1}$$ or use $G$ in an iterative (gradient descent) algorithm $$\eqalign{ &Z = XX^T+\g I \\ &Y = BX^T+\g\A \\ &A_{\tt1} = \A \\ &{\rm for\;} k = {\tt1}\;{\rm to}\;k_{max} \\ &\qquad G_k = A_k Z - Y \\ &\qquad A_{k+1} = A_{k} - \l_k G_k \\ &\qquad k = k + {\tt1} \\ }$$ and stop when $\|G_k\|$ is small enough for your intended application. The stepsize $\l_k$ is specified by the descent algorithm (Barzilai-Borwein is my personal favorite).
In the above, a colon denotes the Frobenius product, i.e. $$\eqalign{ A:G &= \operatorname{Tr}\L(A^TG\R) \;&=\; G:A \\ \L(AZ\R):G &= A:\L(GZ^T\R) \;&=\; Z:\L(A^TG\R) \\ }$$
Update
If you need a non-negative solution $\L(A_{ij}\ge0\R)$ introduce a new independent unconstrained matrix $U$ and use it to construct the $A$ matrix via an element-wise exponential function, i.e. $$A = \e(U) \quad\implies\quad dA = A\odot dU$$ where $\odot$ denotes the elementwise/Hadamard product.
The gradient wrt $U$ is a straightforward calculation $$\eqalign{ d\phi &=\L(AXX^T-BX^T + \g A-\g\A\R):\c{dA} \\ &=\L(AXX^T-BX^T + \g A-\g\A\R):\CLR{A\odot dU} \\ &=A\odot\L(AXX^T-BX^T + \g A-\g\A\R):dU \\ \\ \frac{\partial\phi}{\partial U} &= A\odot\L(A\L(XX^T+\g I\R) - \L(BX^T+\g\A\R)\R) \\ &= A\odot G\;\doteq\; F \\ }$$ While there is no direct solution for this problem, you can still use the gradient descent algorithm to calculate the optimal $U$ matrix $$\eqalign{ \def\U{\widehat U} &Z = XX^T+\g I \\ &Y = BX^T+\g\A \\ &U_{\tt1} = \f\L(\A\R) \\ &{\rm for\;} k = {\tt1}\;{\rm to}\;k_{max} \\ &\qquad A_k = \e(U_k) \\ &\qquad F_k = A_k\odot\L(A_k Z - Y\R) \\ &\qquad U_{k+1} = U_{k} - \l_k F_k \\ &\qquad k = k + {\tt1} \\ }$$ After which you can recover the corresponding non-negative solution $\:A=\e(U)$
Update 2
Let me elaborate on adapting the Barzilai-Borwein method to matrix variables with a bit of pseudo-code.
Include $\,G=0\,$ in the initializations, then $$\eqalign{ \def\DG{\Delta G} \def\DU{\Delta U} \def\DU{\Delta U} {\rm for}&\;k=1\to 9999 \\ &\quad A = \e(U) \\ &\quad \DG = A\odot\L(A*Z-Y\R) - G \\ &\quad G = G + \DG \\ \\ &\quad {\rm if}\;\|G\| \le10^{-13}{\;\rm then\; STOP} \\ \\ &\quad \l = 10^{-6} \\ \\ &\quad {\rm if}\;k>1{\;\rm then} \\ &\quad\quad \l = \frac{\DU:\DU}{\DU:\DG} \quad{\sf OR}\quad \l = \frac{\DU:\DG}{\DG:\DG} \\ &\quad {\rm endif} \\ \\ &\quad {\rm if}\;\l<0{\;\rm then} \\ &\quad\quad \l = \L[\frac{\DU:\DU}{\DG:\DG}\R]^{\frac12} \\ &\quad {\rm endif} \\ \\ &\quad \DU = -\l G \\ &\quad U = U + \DU \\ {\rm end} \\ }$$ Note that three different BB step lengths are shown above.
The first is called the long step, the second is the short step, and the third is the positive step which is the geometric mean of the first two.
The long step often converges a bit more quickly, but I usually use the short step. The problem is that the cross product $\L(\DG:\DU\R)$ term can be negative or zero, so I'd rather have it in the numerator than the denominator. If the term is negative, then switch to the positive step so that you keep going downhill.