I am trying to understand how do digital filters work and how to actually calculate the output numerically.
I have read that they are characterised by a transfer function $H(z)$ which results in a linear constant-coefficient difference equation in the form of
$$y[n]=\sum_{q=0}^M x[n-q]\beta_q - \sum_{p=1}^N y[n-p]a_p.$$
with $\beta_q$ and $\alpha_p$ depending on the desired frequency response.
Then, in order to get frequency-filtered output for sample $n$ of a time series, does one simply need to substitute other variables with numbers? How are those coefficients obtained?
Finally, how do Laplace transform, $Z$-transform, and bilinear transform relate to the subject?
The more I read, the more confused I get. It is silly, I admit, but I can not intuitively understand how can a linear equation transform some signal by attenuating a range of frequencies in the time domain.
The question is a bit too broad (a bit too much, in fact), there are entire books dedicated to Digital Signal Processing. I will try to give you a general picture.
The general principle is the following: the same general concepts which hold for continuous-time signals also apply to discrete-time signals.
The key difference, apart from some "unexpected" results and some details, is the language (i.e., the mathematical definitions) used to describe them. But the most important principles (e.g. sampling) still hold true. Unfortunately for you, this means that the answer to the question in the title (How do digital filters work in the time domain?) is "essentially the same as their analog counterparts" (which, I understand, does not take away the confusion, but keep reading).
Let's step back to the analog domain, $t \in \mathbb{R}$ (I hope you are already well-versed in it, otherwise it will get ugly).
For initially at rest SISO (Single Input Single Output) systems, it is well-known that, denoting with $x(t), t \in \mathbb{R}$, the input signal, with $y(t)$ the "output" (see note at the end) signal, and with $h(t)$ the impulsive response of the LTI (Linear Time-Invariant) system,
$ \begin{align} \textit{Time domain:} \quad y(t) = h(t)*x(t) \end{align} $
where * denotes (linear) convolution. Denoting with $X(s)=\mathcal{L}[x(t)]$, $s \in \mathbb{C}$, the Laplace transform, it is also well-known that
$ \begin{align} \textit{Laplace domain:} \quad Y(s) = H(s)X(s) \end{align} $
The frequency domain (a note on notation: $\omega \in \mathbb{R}$ denotes the pulsation, measured in rad/s, $\omega = 2\pi f$, where $f$ is the frequency, measured in Hertz) can be reached via Fourier transform or by using $H(j\omega)=H(s)_{|s=j\omega}$, where $j=\sqrt{-1}$, from which $H(f)$ (frequency response) can be obtained.
Apart for some details which are of no practical interest right now, the Laplace domain is (informally, although not rigorously) called the "(complex) frequency domain" (as in the Wikipedia picture below).
The concept underlying the formulas is the famous time domain - frequency domain duality. Sometimes, an image speaks more than words:
The key concept is that every operation in the time domain has a corresponding effect in the frequency domain, and viceversa. If you did not understand up to this point, I warn you: it is extremely important to have a solid background when dealing with these concepts.
Your last sentence is slightly discouraging: how can it not? Putting aside the fact that it is linear or not, discrete or continuous time, how can a time transformation not have a frequency counterpart? Analog filters, in reality, work in the time domain (obviously), but are designed using the frequency domain.
I am not saying that time/frequency duality is obvious, but, mathematically speaking, it is simply using the Fourier transform to represent the same information in a different "format". Does the ideal sampling theorem ring a bell?
For discrete-time LTI SISO systems, the concepts are essentially the same, as previously mentioned. Usually, the notation uses $k \in \mathbb{Z}$ to denote (discrete) time.
Discrete-time counterpart of every operation must be defined and used; thus, instead of the usual (linear) convolution, discrete convolution is used. The "equivalent" of Laplace transform is the Zeta-transform (of complex variable $z \in \mathbb{C}$). The Zeta-transform is used to establish the stability of the system.
The "equivalent" of Fourier transform is the DTFT or, if you are working with $N$-sequences (i.e., digital signals), with $N$ finite, the DFT is used, in which case circular convolution is the respective counterpart.
Now (finally!), when $h[n]$ is a FIR, its coefficients are obtained by the methods used for FIR design, which depends on the goal of the filter itself. In other words: unless $y[n]$ is known, then it is not a matter of "simple substitution" of numbers. Far from it.
However, when $y[n]$ is known, $h[n]$ is obtained by using its definition, i.e., setting the input signal equal to the unit impulse (i.e., Kronecker delta) in the input/output model of the system or, if you are aware of it, using the general formula derived exactly for these cases (which can be found in the literature).
This is by no means an exhaustive answer. As it is now apparent, your question touches a major topic of the theory of signals and systems; most probably, you are starting from the wrong viewpoint. I suggest you to start from a basic book on the subject, which will also be useful as a reference in the future.
Note: The "output" signal of a LTI system which is of any practical interest is actually the forced evolution $v_f(t)$, $t \in \mathbb{R}$. The free evolution $v_l(t)$ is usually neglected. The same applies for discrete time systems.