I am not sure if I ask it correctly. I am working on using Fisher Information to examine the information in a model (say neural networks for simplicity).
What I know is that the definition of Fisher Information is
$I(\theta^*)=Var[\frac{\partial}{\partial \theta}\log p(Y|\theta,X)|_{\theta=\theta^*}]$
Conceptually I know that Fisher Information is the Variance of the derivative of log-likelihood. But what is the $\theta$ and the distribution? If Fisher Information is to evaluate the variance at the TRUE $\theta^*$, how come we know the true parameters? and also in the classification case, how can we know the also the true distribution?
What I have is
- A neural network model with some trainable parameters
- it is a classification task, last layer is softmax, loss function is cross-entropy
If we know the true parameters then we don't even have to train the model. So, I don't understand how can people compute the fisher information matrix in the research.
Thanks!