
As the upper image shows, $M_1$ is a distribution whose parameter is θ, while $M_2$ corresponds to $θ'=θ-δg$. $δ$ means learning rate and $g$ means the gradient. $M_2$ is optimised from $M_1$ by gradient descent. $M^*$ is a given truely model and $M_1$ $M_2$ are used to fit the $M^*$. All of the distribution distance between these models ($a b c$) are constructed by KL divergence. What I want to know is whether exists $\hat{M}$ with parameter $\hat{θ}=θ-αδg$ ($α $ is the variable) could form an orthogonal relationship about $M^*$ to get a less distribution distance than $KL(M_2 || M^*)$ and at the same time onto the line between $M_1$ $M_2$.
Let $\,x\,$ be the distance from m2 to m^.
\begin{align*} c^2-(a-x)^2&=b^2-x^2\\ c^2&=(a^2-2ax-x^2)+b^2-x^2\\ 2ax&=a^2+b^2-c^2\\ x&=\dfrac{a^2+b^2-c^2}{2a} \end{align*}
Let d be the distance from m* to m^ $$d=\sqrt{b^2-x^2}=\sqrt{b^2-\bigg(\dfrac{a^2+b^2-c^2}{2a}\bigg)^2}$$