I've always thought of the Jacobain as "a linear approximation on how the output changes with the input". More constructively, for a matrix representing a Jacobian, J, the columns of J represent the inputs, and the rows of J represent the outputs.
So if I had a matrix function $f(P) = V(P)D(P)^{-1}$ where $V(P) = PX$, $D(P) = diag(V^T e_n)$, P is an $n \times m$ matrix, X is a constant $m \times k$ matrix, and $e_n$ is the unitvector with $1$ in the $n$th position, and I wanted to take the derivative with respect to the matrix P, $\frac{\partial f}{\partial P}$, I'd have a jacobian, J where. $$ \frac{\partial f}{\partial P} = J = \begin{bmatrix} \frac{\partial f_{11}}{\partial p_{11}} & \ldots & \frac{\partial f_{11}}{\partial p_{nm}}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_{nk}}{\partial p_{11}} & \ldots & \frac{\partial f_{nk}}{\partial p_{nm}}\\ \end{bmatrix} $$
I understand that this Jacobian is a vectorized version of a 4 dimensional tensor. But if I try to solve this numerically using the formula: $$ \text{inc}(P, ab, \epsilon) = \text{increment } p_{ab} \text{ by } \epsilon\\ \frac{\partial f_{ij}}{\partial p_{ab}} = {\frac{f(P) - f(\text{inc}(P,ab, \epsilon))}{\epsilon}}_{i,j} $$ This Jacobian is completely different then the version I get from the accepted answer here.
For example, one issue with my method is that rows $n, 2n, 3n, 4n$ are all $0$. This makes sense to me since everything is being divided by the last row, therefore the last row's output is always 1, and since it's constant, it doesn't depend on anything, so the derivative is always $0$. This is not what I get when doing the method in the link above.
My question is, what am I misunderstanding about the Jacobian? Why does taking the derivative directly create a vastly different result than taking the derivative numerically in this case.
So, As it turns out, my intuition was correct. After trying almost everything else (changing epsilon, splitting), I tested the derivative of my matrix against an automatic numerical differentiator, and the matrix using my method and the one from the program were the same. I think the intuition here is right. I assume the confusion came from a different mistake I made somewhere else.
If someone stumbles along this and is wondering what to try: