Fisher Information Matrix in machine learning

51 Views Asked by At

In these weeks I am reading some machine learning papers dealing with Fisher information theory. Given a parameter set $\Theta \in \Bbb R^d$, I have always defined the Fisher information of a statistical model $\mathcal{M}_{\Theta} = \{p_{\theta}(y|x): \theta\in\Theta\}$ as $$ F(\theta) = \mathbb{E}_{p_{\theta}(x,y)}\big[(\nabla \log p_{\theta}(x, y))^{\otimes 2}\big]. $$ Surprising, I have found works using another definition. Given an unknown distribution $p(x,y)$ and a parametric model $\mathcal{M}_{\Theta}$, the Fisher information is defined as $$ F(\theta) = \mathbb{E}_{p(x,y)}\big[(\nabla \log p_{\theta}(x, y))^{\otimes 2}\big] $$ These two definitions are different. While the first definition is justified and there is a huge mathematical literature in it, the second one is a little bit mysterious.

Questions

  • May someone help me to understand the meaning of this definition?
  • Do you know where and who introduced this definition?

Thanks!