In Cover and Thomas' "Elements of Information Theory", the joint entropy $H(X,Y)$ is defined, but they state that this definition is nothing new if we consider that it is the entropy of a single vector valued random variable $(X,Y)$. Then it goes on to define conditional entropy by itself, but the previous remark got me thinking if this too could be just the entropy of a random variable $X|Y$.
But is $X|Y$ really a random variable? It seems that much of the time we talk about random variables, we use them to state facts about the probability distributions associated with them, which is interesting because (correct me if I am wrong) formally they are just functions from $\Omega$ to $\mathbb{R}^n$ and they don't carry information about the distribution. Entropy of a random variable is actually entropy of a PMF which we associate with that random variable in our heads.
That leads me to beleive that $X|Y = X$ formally, in the sense of functions, and the only distinction is that we "change the PMF of $X$". Am I making any sense or is this interpretation wrong?
If I am correct, an additional question would be, why is it common practice to stretch the definitions in this way and talk about RVs so freely when the actual object of interest is the distribution?
$X|Y$ is not a RV, but $E(X|Y)$ is!
And you can define $H(X|Y) = H(X) - H(E(X|Y))$. For example, if $X,Y$ are independent, then $E(X|Y)$ is just a number, hence entropy 0, and then $H(X|Y) = H(X)$.
"formally they are just functions from $\Omega$ to $R^n$ and they don't carry information about the distribution." not completely correct, because $\Omega$ has to be endowed with a measure. In fact, the exact $\Omega$ does not really matter as the same RV can be defined in many possible ways on many sample spaces.
What $E(X|Y)$ does, is reduces "granularity" of subsets of $\Omega$. For example, for RV $X$, $\omega_1$ and $\omega_2$ might lead to different values, but for $E(X|Y)$, to the same value.