Intuitively, what are the differences between $X$, $\mathbb{E}(X)$ and $\mathbb{E}(X\mid\mathcal{F})$?
The measure-theoretic approach to probability has been rather confusing. Prior to learning about measure theory in probability, I always learnt that
i) $X$ is a random variable, and ii) $\mathbb{E}(X)$ is a constant by "averaging" over all events in the sample space. Then, in the measure theory course, I learnt that i) $X$ is a measurable function, i.e. the pre-image $X^{-1}(A)=\{\omega\in\Omega:X(\omega)\in A\}$; ii) $X$ is $\mathcal{F}$-measurable, then $\mathbb{E}(X\mid\mathcal{F})=X$. I also interpret $\mathcal{F}$ as the “information we currently have” (I know mathematically it is the $\sigma$-algebra generated by $\Omega$.
Therefore, to ensure that I have the right intuition for the three quantities $X$, $\mathbb{E}(X)$ and $\mathbb{E}(X\mid\mathcal{F})$, i) which of these are random variables and which are constants; and ii) how do I interpret the differences between each one (particularly between $\mathbb{E}(X)$ and $\mathbb{E}(X\mid\mathcal{F})$)?
Those definitions and what they mean intuitively can indeed be confusing when you are new to subject.
The first point to make here is that one should view all those expectations as functions on the underlying sample space $\Omega$. There is a hidden argument $\omega \in \Omega$ in all of them, so we really have $$ X=X(\omega), \quad E[X] = E[X](\omega), \quad E[X|\mathcal{F}] = E[X|\mathcal{F}](\omega). $$ All of those should be seen as the best possible approximation of the function $X(\omega)$ given the information currently at hand. To make this precise we need the concept of filtration of $\sigma-$algebras.
The stochastic variable $X(\omega)$ itself is really nothing more than a measurable function on some $\sigma-$algebra $\mathcal{F}$. It maps points $\omega$ in the sample space to values.
The collection of sets in $\mathcal{F}$ should be seen as representing all those events, collection of sample points, you can separate given the possible values of $X(\omega)$.
To make this clear, let us make a very simple example:
Let $X(\omega)$ be defined on the discrete set $\Omega = \{ 1,2,3,4\}$ with values in $\mathbb{R}$, where $X(1) = 10, X(2) = 10$, $X(3)= 20, X(4) = 30$. We can see this $X$ as random draws from $\Omega$, but there is no need to introduce any probability measure at this point.
The usual Borel $\sigma$-algebra on $\mathbb{R}$ now induces a $\sigma$-algebra on $\Omega$. The reverse image of $X$ of sets in $\mathbb{R}$ can be precisely one of the sets in $$ \mathcal{F} = \{\emptyset, \,\Omega, \{1, 2 \}, \{ 3\}, \{ 4\}, \{ 3, 4\}, \{1,2,3 \}, \{1, 2,4 \}\} $$ If you only are allowed to observe the possible values of $X(\omega)$, but not $\omega$ itself you will never (regardless which outcome you observe) be able to answer the question: Did the random draw result in a $1$ or $2$?
If the value observed is $20$, you know $\omega$ was neither $1$ or $2$. If the value observed is $10$ you know $\omega$ is either $1$ or $2$.
When one says that the sigma algebra $\mathcal{F}$ does not contain the information to separate the sets $\{ 1\}$ and $\{2\}$, this is what it means. Those sets are not in the $\sigma$-algebra and hence cannot be distinguished by looking at the possible function values of $X$ on them since X has the same value for every $\omega$ in the set: $X(1) = X(2)$.
We can see $X(\omega)$ as outcome of an experiment resulting in the sample $\omega$. Everything about it is known. We know the function $X$ and we know the sample and can calculate $X(\omega)$.
Here it is helpful to introduce a time dimension: Let us assume the situation where we fully know $X$ and the full outcome of the experiment $\omega$ occurs at a time $T>0$. The $\sigma$-algebra considered then is $\mathcal{F}_T := \mathcal{F}$. Let us also assume we have a probablility measure $\mathbb{P}$ describing the probability of the different outcomes for $X$.
We now have a full probability space $(\Omega, \mathcal{F}_T, \mathbb{P})$. Here, by definition we have $$ X(\omega) = E[X|\mathcal{F}_T](\omega). $$ $X$ is fully known given the information in the $\sigma$-algebra $\mathcal{F}$.
Assume now that we are at a earlier point in time when the full outcome of the experiment is not known. For example, let us start at time $0$. At that point we assume nothing about the experiment is know. We know the full experiment can result in the values $10,20$ or $30$, but so far we have no information about the outcome except what the possible values are and their probabilities.
At this point in time we cannot differentiate between any of the subsets of $\mathcal{F}$. Given a set of $\mathcal{F}$, we cannot answer the question if the sample $\omega$ lies in the set or not, since nothing about $\omega$ is known at this point.
We now want to make the best possible approximation of the function $X(\omega)$ at time $0$. Since nothing is known, the best we can do is approximate it with the constant function which equals its "usual" expected value $$ E[X](\omega) = \sum_i X(\omega_i) \cdot \mathbb{P}(\omega_i) $$ So the function does not depend on $\omega$ and is therefore constant. We have taken the probability average over all possible outcomes. We cannot differentiate between any outcomes so far. That means that the natural (i.e the largest one where it is measurable) $\sigma$-algebra for the expected value function here is $\mathcal{F}_0 = \{ \emptyset, \Omega\}$.
By the way, regarding notation, we have by definition that $$ E[X] = E[X](\omega) = E[X| \mathcal{F}_0](\omega) = \textrm{ A constant} $$
I want to make a remark here about something that can be a bit confusing. At that point in time $t=0$, it would be natural to see the outcome $\omega$ as the thing that is unknown and not the function $X$. However, this is represented a bit as the other way around mathematically: In the calculations we view the $\omega$ as fully known entity, but the expected value function at the time can only "see" part of the information represented by $\omega$ (which is nothing at time $0$).
Now assume we are at some point $t$ in time between $0$ and $T$ where we know a little bit about the possible outcomes of the experiment, but not all there is to know. Let us say, that at that time we know if the experiment had an outcome strictly smaller than $30$ or not.
That means we know if $\omega \in \{1,2,3 \}$ or if $\omega \in \{ 4\}$. For any possible $\omega$, we can, for those two sets say for sure if it belongs to the set or not.
This is not the case for sets like $\{ 2\}$ or $\{ 3, 4\}$. If we know the outcome was smaller than $30$, $\omega$ could be in the set $\{ 2\}$, but we can't know. The same for the other set, if the outcome was smaller than $30$, $\omega$ could belong to $\{ 3, 4\}$, but we don't know.
The $\sigma$-algebra of sets we can separate from the information given at that time is $$ \mathcal{F}_t = \{ \emptyset, \Omega, \{1,2,3\}, \{ 4\}\} $$ We clearly have $$ \mathcal{F}_0 \subset \mathcal{F}_t \subset \mathcal{F}_T $$ Now we want to calculate the expected value of $X$ at time $t$ given the information described. This is still a function on $\Omega$ but only measurable on the $\sigma$-algebra, $\mathcal{F}_t$. We cannot separate the points in the set of $\mathcal{F}$ based on our current information. Therefore the expected value of $X$ given the information must be constant on these sets and the best approximation we can do on those sets is to take the expected value given that we know that the sample $\omega$ belongs to the set:
$$ \begin{eqnarray*} E[X | \mathcal{F}_t](\omega) &=& \frac{1}{\mathbb{P}(\{ 1,2,3\})} \left( X(1) \cdot \mathbb{P}(\{ 1\}) + X(2) \cdot \mathbb{P}(\{ 2\})+ X(3) \cdot \mathbb{P}(\{ 3\})\right) \textrm{ if } \omega \in \{ 1,2,3\}\\ E[X | \mathcal{F}_t](\omega) &=& \frac{1}{\mathbb{P}(\{ 4\})} ( X(4) \cdot \mathbb{P}(\{ 4\}) = X(4) \textrm{ if } \omega \in \{ 4\} \end{eqnarray*} $$ So $E[X | \mathcal{F}_t](\omega)$ is here a function taking two different values. This is our best approximation of $X$ given the information described.
In general we have a whole so called filtration of sigma algebras $$ \mathcal{F}_0 \subset \mathcal{F}_s \subset \mathcal{F}_T, \quad 0 < s < T, $$ where the $\sigma$-algebras are indexed by a variable $s$ one can se as time where the information increases with time.
So all the functions $E[X | \mathcal{F}_t](\omega)$ can be seen as the best possible approximation of $X$ given the information at a certain time, starting with zero information and a constant function to full information and the full $X(\omega) = E[X | \mathcal{F}_T](\omega)$.
In this example I have used a discrete variable to illustrate the principle. In the continuous case sums will be replaced by integrals and the expected values will be averaged integral values instead.