Suppose I have two $N>K$ tall full column-rank matrices $A$ and $B$, and I want to compute their minimum distance up to a matrix multiplication, i.e.,:
$$\min_{T} \|A-BT\|_F^2$$
where $T \in \mathbb{R}^{K \times K}$ has a specific structure, invertible by construction, which renders the problem non convex since each column has some repeated terms with powers.
In principle I would like to have $A=BT$, and since $A,B$ are full-rank, if the equality would be satisfied then $T= B^\dagger A$, and $T$ would be with the structure I search. But assume I obtain noisy $A$ and $B$ for which the equality does not hold. Then, I would like to find the $T$ such that $A-BT$ is as close to $0$ as possible. How could I find the $T$ with the particular structure I have?
A similar argument might be done in the standard least square $y-Ax$, where $A^\dagger y$ minimizes the error. But this is true if $x$ is not structured. If you know that $x$ has a particular structure, how do you encode such constraint in the problem? Can you project the solution obtained with the pseudoinverse into the space of solutions $x$ which have the desired structure?
Note: It is difficult to encode the structure of $T$ with a constraint (e.g. symmetric, diagonal, etc..)
$ \def\a{\alpha}\def\b{\beta}\def\l{\lambda} \def\m#1{\left[\begin{array}{c}#1\end{array}\right]} \def\p{\prime} \def\grad#1#2{\frac{\partial #1}{\partial #2}} \def\LR#1{\left(#1\right)} \def\BR#1{\Big(#1\Big)} \def\qiq{\quad\implies\quad} \def\cred#1{\color{red}{#1}} \def\cblue#1{\color{blue}{#1}} \def\CLR#1{\cred{\LR{#1}}} $I'm not sure if this will address your concerns about structure, but consider the following...
Given a vector of parameters $\{p\}$ and basis of $(0,1)$-matrices $\,\{M_k\},\,$ e.g. $$\eqalign{ p &= \m{\a \\ \b},\qquad M_1 = \m{1 & 1 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0},\qquad M_2 = \m{0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 1} \\ }$$ create a structured matrix $\{X\}$ and cost function $\{\phi\}$ $$\eqalign{ X &= \sum_{i=1}^2\;p_iM_i \;=\; \m{\a & \a & 0 \\ 0 & \b & 0 \\ 0 & \b & \b},\qquad &\phi = \tfrac 12\Big\|BX-A\Big\|_F^2 \\ }$$ Note that the parameters $(\a,\b)$ are the only independent variables in the entire problem.
When $X$ is unconstrained it's easy to calculate the gradient/differential of the cost $$\eqalign{ G = \grad{\phi}{X} = B^T(BX-A) \qiq d\phi = G:dX \\ }$$ However, since we've imposed structure on $X$, its differential is also constrained $$dX = \sum_{i=1}^2 M_i\,dp_i$$ Substituting this expression leads to the $\,\cred{parametric\:gradient}$ $$\eqalign{ d\phi &= \sum_{i=1}^2\;G:(M_i\,dp_i) = \sum_{i=1}^2\LR{\grad{\phi}{p_i}} dp_i \quad\implies\quad \cred{\grad{\phi}{p_i} = G:M_i} \\ }$$ where a colon denotes the matrix inner product, i.e. $\;A:B = {\rm Tr}(A^TB)$
At this point, simply optimize with respect to the $p$-vector. $$\eqalign{ &0 = \grad{\phi}{p_i} = \LR{B^TBX-B^TA}:M_i \\ &\cred{B^TA:M_i} = (B^TBM_i):X = \sum_{j=1}^2\cblue{(B^TBM_i):M_j}\,p_j \\ &\cred{w_i} = \sum_{j=1}^2 \cblue{H_{ij}}\,p_j \\ }$$
This is a linear equation which can be easily solved for the optimal $p$-vector $$\eqalign{ &w = Hp \qiq p=H^{-1}w \\ }$$ from which the optimal $X$-matrix can be computed $$\eqalign{ X &= \sum_{i=1}^2\;M_i\,p_i \\ }$$ In this particular example, $\cblue{H}$ happens to be the Hessian of the cost function, which should be a positive definite matrix.
Update
Assume that $X$ contains nonlinear elements, e.g. $\,X_{33}=2\a\b^2.$
To accommodate this create a new dummy parameter $$\l=2\a\b^2\qiq d\l = \CLR{2\b^2d\a + 4\a\b\:d\b}$$ and modify the parameter vector and matrix basis $$\eqalign{ p^\p = \m{\a\\ \b\\ \l} ,\quad M_3^\p = \m{0&0&0\\0&0&0\\0&0&1} ,\quad M_2^\p = \LR{M_2 - M_3^\p} ,\quad M_1^\p = M_1 \\ }$$ Then everything proceeds as before $$\eqalign{ X &= \sum_{i=1}^3\;M_i^\p\,p_i^\p \qiq dX = \sum_{i=1}^3\;M_i^\p\,dp_i^\p \\ d\phi &= \sum_{i=1}^3\;\LR{G:M_i^\p}\,dp_i^\p \\ &= \LR{G:M_1^\p}\,dp_1^\p + \LR{G:M_2^\p}\,dp_2^\p + \LR{G:M_3^\p}\,dp_3^\p \\ }$$ But the cost differential can be rewritten in terms of the original parameters $$\eqalign{ d\phi &= \LR{G:M_1^\p}\,dp_1 + \LR{G:(M_2^\p)}\,dp_2 + \LR{G:M_3^\p} \CLR{2 p_2^2 \:dp_1 + 4 p_1 p_2\:dp_2} \\ &= \BR{G:\LR{M_1^\p+2p_2^2 M_3^\p}}\,dp_1 + \BR{G:\LR{M_2^\p+4 p_1 p_2 M_3^\p}}\,dp_2 \\ \grad{\phi}{p_1} &= G:\LR{M_1^\p+2p_2^2L} , \quad\grad{\phi}{p_2} = G:\LR{M_2^\p+4 p_1 p_2 M_3^\p} \\ }$$ This time setting the parametric gradients to zero yields a system of nonlinear equations, so one must resort to numerical methods for a solution, e.g. Newton-Raphson or Gradient Descent.
The key idea is to introduce dummy parameters such that $\{X,dX\}$ are linear combinations of $\{p_k^\p,dp_k^\p\}$ then substitute all dummy $\{dp_k^\p\}$ with the original $\{dp_k\}$ in the cost differential.
It is also convenient to choose an orthogonal basis, i.e. $\:M_j^\p\odot M_k^\p = \delta_{jk}M_k^\p$
so that the $\LR{M_j^\p:M_k^\p}$ cross-terms will cancel when you substitute $X=\sum\;M_j^\p\,p_j^\p$