I'm struggling with understanding what Fisher Information is. Here is my thought process.
We begin with the PDF $f(x)$ of some distribution. Let's assume we are dealing with a continuous case on the real line. Then, in the framework of the statistics class, we say that our distribution also depends on some parameter $\theta$, which is unknown. So, our PDF becomes a function of two parameters - $f(x,\theta)$.
We observe an experiment. We get some results $X_1$, $X_2$, ..., $X_n$, that are i.i.d. We consider the joint PDF function $g(\bar X,\theta)=f(X_1,\theta)f(X_2,\theta)...f(X_n,\theta)$.
With the experiment result known, our function $g$ becomes a function of only one variable $\theta$ and is denoted by $g(\theta|\bar X)$. And it is now called the likelihood function.
We are now interested in knowing how much this particular experiment result is informative. The quantity assigned is the Fisher Information, that is defined by:
$I(\theta)=E_{\theta} [\frac{d}{d\theta}(ln(g(\theta |\bar X))^2]$.
First thing i don't understand is the role of the function $ln$ here. Taking it as it is, my further understanding is following:
We consider the graph of the likelihood function $g(\theta)$. Then, we consider the slope of this function. Because, the bigger the slope is, the bigger the difference between the likelihood function values on close values of $\theta$ will be, and we will be able to locate $\theta$ more precisely (hence, "more information"). Now, we don't know the exact value of $\theta$, so we would like to average that slope value over all domain of $\theta$. We will also have to account for the fact that different values of $\theta$ can appear with different probabilities. The function of expected value $E$, then, is exactly what we need.
We arrive at this: $E_{\theta} [\frac{d}{d\theta}(ln(g(\theta |\bar X))]$=$\int _{\Theta} {\frac{d}{dt}(ln(g(t |\bar X))\theta(t)dt}$, where $\theta (t)$ is the PDF of the parameter $\theta$ and $\Theta$ is the domain of $\theta$. Now, this itself is not very representative, since the slope, being negative or positive, can cancel out, so we add squaring, and arrive at the definition.
Now, I don't understand how $I$ is a function and not a number (depending on our experiment result)- after we've taken the integral, $\theta$ must be gone. And, I also don't understand why is it that in the problems I am given in the class, I am given the distribution of $X_i$, and not of the parameter $\theta$.
What are the mistakes and missing points in my reasoning?
Under certain regularity conditions, there is an equivalent form of the Fisher information.
$$I(\theta) = E_\theta \left[\left(\frac{d}{d\theta} \ln g(\theta \mid X)\right)^2\right] = -E_\theta \left[\frac{d^2}{d\theta^2} \ln g(\theta \mid X)\right]$$
This form involving the second derivative has an intuitive interpretation: it captures the curvature of the log likelihood function: a high Fisher information at some $\theta$ indicates that the log likelihood is sharply peaked there. I think your attempt to interpret the other form (the square of the first derivative) is ok too.
Regarding the logarithm, Wikipedia notes that some sources use a different definition where the logarithm is absent. I haven't encountered this myself, and would not bother with this alternate definition. Basing your interpretations on the shape of the log likelihood as you have attempted to do is fine. As for why we usually consider the log likelihood instead of the likelihood, it is because this quantity shows up naturally, most notably as the Cramer-Rao lower bound.
$I$ is still a function of $\theta$. The expectation is taken with respect to the randomness in the data $X$, but the expectation itself still depends on $\theta$. I think your use of $\theta(t)$ for the PDF with parameter $\theta$ is a little problematic; one usually uses $f_\theta$ or something similar. After taking that integral, the quantity you obtain will depend on $\theta$ since the density depends on $\theta$.