suppose $M\in R^{N\times N}$ is a positive definite matrix, each entry $M_{ij}=f_{ij}(x)$ is a function of a p-dimensional input variable $x\in R^{p}$, we want to use the gradient descent to optimize a function $g(x)=h(M^{-1})$ (assume we can calculate all the gradients).
(for example, in the Gaussian process model, we can use the gradient-based method to optimize the marginal log likelihood: $ \log p(\mathbf{y}\mid \mathbf{X},\Theta,\sigma)=-\frac{1}{2}\mathbf{y}^{T}(K+\sigma^2 I)^{-1}\mathbf{y}-\frac{1}{2}\log |K+\sigma^2 I| - \frac{N}{2}\log 2\pi$ with respect to hyper-parameters $\Theta$ and $\sigma$, where the matrix $K$ is a function of $\Theta$)
Suppose $N=\Theta(p)$, I am curious about the cost we need to spend on calculating the gradient of this function for each iteration. Naively, every time we need to calculate the inverse of $M$ once, so when $N=\Theta(p)$, the cost of the gradient descent for each iteration is at least $\mathbf{O}(p^3)$, which would not be the same as the cost the gradient descent needs to pay in general (which is $\mathbf{O}(p)$).
Is this analysis correct? Are there any methods that can help reduce the cost used in this kind of situation?
Thank you!