What is the difference between a response, output, hidden and latent variables in modeling?

656 Views Asked by At

I'm a learner of machine learning and statistics and I have some experience with both of the subjects. However, until this day it has not yet been fully revealed to me what is the fundamental difference between the four following variables:

  • Response variable
  • Output variable
  • Hidden variable
  • Latent variable

To my knowledge, in data analysis our task in general is to find the functional relationship between explanatory observed data $x$ and the variable of interest $y$. That is, in general our goal in many modeling applications is to find a function $f$ such that:

$$f(x)=y,$$

using some data set $D={(x_1,y_1), (x_2,y_2), ..., (x_n, y_n)}$. I think the "response" and "output" variables are synonyms and generally refer to the $y$ variable.

But what about "hidden" and "latent" variables? What do they refer to? Are they also synonyms for $y$ or do they refer to the parameters of $f$?

Concrete and simple examples would be both sufficient and excellent answers, thank you!

UPDATE:

As requested, I will also add the following into the list of variables to be explained:

  • independent variable
  • dependent variable
  • confound variable
2

There are 2 best solutions below

8
On BEST ANSWER

A hidden variable is a variable that you cannot measure by a sample process. For instance, this appears in the hidden Markov model (HMM), where you can sample output data $(y_1,\ldots,y_n)$ but not the states $(x_1,\ldots,x_n)$ which led to the output.

Note that the difference is between observable and hidden variables. The observable variables are instantiated (there are values for these variables) while there are no values for the hidden variables.

In the HMM (which is a specific Bayesian network), given the observable (here output) variables (values for them), estimate the most probable values of the state variable that led to the output.

The initial application of the HMM is speech recognition by Viterbi in the 1960s. Basically, the output are spoken siblings (output observed) and the corresponding states (hidden) are the written (real) siblings. The aim is to find the most probable sequence of written siblings (sentences) corresponding to the spoken words. The associated algorithm is named after its inventor Viterbi.

0
On

I will also add and update my own interpretations here as I get answers:

  • Response/output/dependent variable: all three definitions are synonyms and refer to the variable $y$ we aim to explain/predict via model $f(x)$ and independent variables $x$. Example of $y$ could be the temperature.

  • Independent variable: a variable $x$ which is the input in our model $f(x)=y$. This can be e.g. time, given time $x$, what is the temperature $y$ outside?

  • Hidden/latent variable: A variable(s), which we aim to infer from observed input data $x$. In many contexts, the latent/hidden variable refers to the output variable which we can not (or is very hard) to measure directly. For example in K-means clustering, our point is to find out into which cluster each $x$ observation belongs to. This cluster id refers to the latent variable $y$. We usually use some iterative method (such as expectation maximization) to infer the latent/hidden cluster $y$-values by maximizing the joint probability between the $x$ and $y$.

  • Confound variable: quoted text: "A confounding variable is an outside influence that changes the effect of a dependent and independent variable. This extraneous influence is used to influence the outcome of an experimental design. Simply, a confounding variable is an extra variable entered into the equation that was not accounted for. Example is the correlation between murder rate and the sale of ice-cream. As the murder rate raises so does the sale of ice-cream. One suggestion for this could be that murderers cause people to buy ice-cream. This is highly unlikely. A second suggestion is that purchasing ice-cream causes people to commit murder, also highly unlikely. Then there is a third variable which includes a confounding variable. It is distinctly possible that the weather causes the correlation. While the weather is icy cold, fewer people are out interacting with others and less likely to purchase ice-cream. Conversely, when it is hot outside, there is more social interaction and more ice-cream being purchased. In this example, the weather is the variable that confounds the relationship between ice-cream sales and murder."