I'm confused about two things. For the set up, I am told that
Likelihood function is $L(\theta)=\Pi f(y_i;\theta)$ for a distribution with pdf $f(y;\theta)$.
Log likelihood function is $l(\theta)=log(L(\theta))$.
Score function is $U=\frac{dl}{d \theta}$
These are the notations that will follow
Okay, sure. Definitions for convenience, I'm happy with that. But now I am looking at a proof that $\mathbb{E}(U)=0$. It does something bizarre; namely, the score "function" becomes a "random variable with pdf $f(y;\theta)$." Question is, why?
We have $\frac{dl}{d \theta}=\sum_{i=1}^n \frac{d \text{log}f(y_i;\theta)}{d \theta}$ (by using the property of log. I'm good up to here). For a articular random variable $Y:=Y_i$ i.e. for some particular $i$, we view
$$U=\frac{d \text{log}f(Y:\theta)}{d \theta}$$
as a random variable with the distribution determined by the distribution of $Y$.(.....huh?)
Nope, lost it.So, okay, we somehow came up with a new random variable? $U$...but how does this new variable have the same distribution (i.e.$f$) as $Y_i$s? Can someone give an elaborate explanation on that? I say this because this proceeds to,
$$\mathbb{E}(U)=\int \frac{d \text{log}f(Y:\theta)}{d \theta} f(y:\theta)dy$$
which tells me that, $\int xf(x)dx$ it seems like $\frac{d \text{log}f(Y:\theta)}{d \theta}$ is the variable(as mentioned) and has pdf $f$. I don't see why and how I can reason that it has pdf $f$.
Second, it then proceeds with computation and arrives at
$$\mathbb{E}(U)=\int \frac{d}{d \theta}f(y:\theta)dy= \frac{d}{d \theta} \int f(y:\theta)dy$$
which I don't understand why it becomes,
$$\frac{d}{d \theta} \int f(y:\theta)dy=\frac{d}{d \theta}(1)$$
How is $\int f(y:\theta)dy=1$? I mean, say if $f(y:\theta)=\theta y$ hen clearly, $\int f(y:\theta)dy=\frac{\theta}{2}y^2$ and that's already a counter example. Well I get in trouble if I have to integrate from $-\infty, \infty$ but still, I don't see why i becomes $1$.
Can someone answer the two questions?
---update Ignore my second question, I figured it. It's because $f$ is a pdf, isn't it?
You got the second part. Now for the first:
$U$ is the lope of the log-likelihood function. Now here I think there needs to be some clarification. Given a probability model $f(y;\theta)$, and a sample $y:= (y_1,y_2,...,y_n)$, we get a log-likelihood function $l(\theta)$ of the parameter being inferred.
Therefore, we really should write $U(\theta)=\frac{dl}{d\theta}$.
With that aside, let's focus on what it means for $E(U(\theta))=0$. First, note that we are evaluating the score at the true value of $\theta$ (i.e., the value from the distribution that generated the $y_i$). In this situation, it would have been clearer for the writer to have used:
$$E_{\theta} [U(y;\theta)] = 0 $$
This notation is clearer because it shows that that score is a function of both the data and the parameter it is being evaluate at, AND the parameter of the data-generating distribution (i.e., true parameter).
Normally, we think of the likelihood as a value, but, of course, it's also a sample statistic. Same goes for the score.
The way I've written the expectation above highlights that the score is a function of a random sample $y$ and a fixed parameter at which we are evaluating the function (the "argument" of the function). Going back to basic probability theory:
$$X\sim f(x) \implies E[g(X)] = \int g(x)f(x)dx$$
This applies to any function $g(\cdot)$ of a random variable, as long at it is reasonably smooth (no Weierstrass functions!) such as those that are continuous and differentiable or can be integrated over a counting measure.
Given this the formula for getting the expected value of a function of random variables, we can see that $U(y;\theta)$ is a function of a random vector and hence its expected value is calculated using the pdf of the random vector Y:
$$E_{\theta}(U(Y;\theta)=\int U(Y;\theta)f(Y;\theta) dY$$
If $Y$ is a true vector, then the above will be a multiple integral.
Now, we usually assume we have and iid sample, in which case:
$$E_{\theta}[U(Y;\theta)] = E_{\theta}[U(y;\theta)] = \int U(y;\theta)f(y;\theta) dy\;\;\forall y \in Y$$
At which point the rest of the argumement follows.
So, the first key point to keep in mind are that the score function is a function of random variables, and therefore its mean is subject to the pdf of those variables. The second is that when they are iid and we evaluate the score function at the true value (that of the $y$), then we expect it to be zero, which makes sense since the slope of the log-likelihood at the maximum value is 0, and we expect the likelihood at the true theta to take on a local maximum (on average).