I know that from measure-theoretic probability, $E(Y\mid X)$ and $E(Y\mid X=x)$ are different in nature: the former is "conditional on a random variable" and the latter is "conditional on an event" (let's assume it is a null event here). But I am still not sure about a few things:
- When the two are equivalent, i.e. one implies the other?
- If I specify $E(Y\mid X)=X$ and $E(Y\mid X=x)=x$, are the two equations equivalent, i.e., one implies the other?
- When discussing, e.g., Statistical models like linear regression, we often write $E(Y\mid X)=X\beta$. In this case, are we using "conditioning on random variable" or "conditioning on event"? (This question will be trivial if the answer to question 2 is yes).
The expression $X=x$ identifies a particular event, i.e. a particular subset of a probability space, and one conditions on events, speaking of conditional probabilities given an particular event, and therefore of conditional probability distributions given an event, and therefore of conditional expected values given an event.
The conditional expected value $\operatorname E(Y\mid X=x)$ is a conditional expected value given an event. What number it is depends on what number $x$ is. So it's a function of $x.$ Call it $g(x).$ We have $\operatorname E(Y\mid X=x) = g(x).$
Then $\operatorname E(Y\mid X)$ is the random variable $g(X).$
Consequently $\operatorname E(Y\mid X)=X$ is true if and only if for all values of $x,$ $\operatorname E(Y\mid X=x) = x$ is true.
In linear regression, one typically has $Y$ is an $n\times 1$ column vector, $X$ is an $n\times p$ matrix, $\beta$ is a $p\times1$ column vector, and $\operatorname E(Y\mid X) = X\beta.$ Now note that
Often it's written as $\operatorname E(Y) = X\beta,$ and neither $X$ nor $\beta$ is treated as random. What is random is the "errors", so one has $Y=X\beta+\varepsilon,$ where $\varepsilon$ is a random $n\times 1$ column vector whose expected value is $0,$ i.e. is an $n\times1$ column of $0$s. In some problems of statistics $X$ is fixed by design, that is, the experimenter is able to choose the value of the matrix $X.$ In some other problems, it may be that the experimenter cannot choose $X$ but every time a new sample of $n$ observations is chosen, $X$ remains the same and $\beta$ remains the same, so only $\varepsilon,$ and hence $Y,$ changes. In that case, the condition is neither a random variable nor an event, but rather a parameter that determines the probability distribution of $Y.$ In all regression problems I know of, $X$ is a part of the observable data (as is $Y$) and $\beta$ is unobservable and is to be estimated based on the observation of $X$ and $Y.$ The estimate $\widehat\beta$ then becomes a random variable that one expresses as a function of $X$ and $Y.$
But it is also often the case that whenever a new sample of $n$ observations is taken, both $X$ and $Y$ change. In that case, $X$ is a random variable and $\beta$ is not. However, in estimating $\beta$ by least squares in such problems, $X$ is in effect treated as not random, and the justification of that is that one is conditioning on $X.$
Sometimes one assigns a prior probability distribution to $\beta,$ not because $\beta$ is random in the sense of being something that changes each time a new sample of $n$ observations is taken, but because the value of $\beta$ is uncertain. In that case, instead of using least squares or any of its relatives, on multiplies the likelihood function (a pointwise defined function of $\beta$) by the prior probability measure on $\beta,$ and then normalizes, to get the posterior probability distribution of $\beta.$