I know the vague idea about how matrix secant methods work: The essence of the method is still using Newton’s method to do the updates, and thus it is doing root finding on $f(x)$ and updating the the $x_k$ iterates by this:
$$x_{k+1} = x_k – J’(x_k)^{-1} J(x_k)$$, where $J(x_k)$ is the jacobian of $f(x)$
then of course the secant method is striving to do what newton’s method does, with two modifications:
1) we are trying to find the roots of the gradient, $g(x) = \nabla f(x) = 0 $
2) we take a secant approximation of the Jacobian instead of the jacobian itself: $M_k (x_{k+1} – x_k) = g(x_{k+1}) – g(x_k)$
Then, the next idea of the matrix secant methods is that consecutive iterates on the approximation of $M_k$ be close to each other, i.e. minimize $ ||M_k – M_{k+1} || $.
For example, in Nocedal and Wright, we have the sentence: “…and a requirement that the difference between successive approximations have low rank.” Why is it that we want consecutive iterates of the secant methods to be “close” to each other in this sense?
You asked a lot of questions, and I will attempt to provide an "answer" - the quotes are there since to the best of my knowledge there is no formal proof for the fact that these methods work better than the gradient method, despite their excellent performance in practice.
So first, let's look at (the pure) Newton's method for minimization - we find a descent direction $d_k$ by solving the system $\nabla^2 f(x_k) d_k = -\nabla f(x_k)$, and then compute $$ x_{k+1} = x_k + d_k = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k). $$
First, observe that $d_k$ is indeed a descent direction if and only if $\nabla^2 f(x_k)$ is positive definite. Since $M_k$ attempts to imitate the Hessian, it should also be positive definite. This answers your second question (in the comments).
To answer the rest, I will need to look from another perspective. Observe, that Newton's method is equivalent to $$ x_{k+1} = \operatorname*{argmin}_x \{ f(x_k) + \nabla f(x_k)^T (x - x_k) + \frac{1}{2} (x - x_k)^T \nabla^2 f(x_k)(x - x_k) \} $$ Indeed, when $\nabla^2 f(x_k)$ is positive definite the function inside the braces is convex, and if you compare its gradient w.r.t $x$ to zero, you obtain the Newton update. What is inside the braces is the 2nd order Taylor approximation. So Newton's method essentially assumes that the 2nd order Taylor approximation is a good approximation for the function, which we can minimize instead of minimizing the function itself.
Remember that the error term in the 2nd order approximation includes the 3rd derivatives. If they are small, the approximation is good. Thus, Newton's method works well when the 2nd derivative changes 'slowly' in some sense. And indeed, there are theorems which rely on this slowness to formally prove its convergence: either assume that $\nabla^2 f$ is Lipschitz continuous, or assume that $f$ is self-concordant.
This analysis attempts to intuitively answer your first question - why do we require consecutive $M_k$ to be close to each other - it is to imitate the behavior of the Hessian of such 'good' functions where the Hessian changes slowly.