Making sense of GLM's problem setting

89 Views Asked by At

I'm totally new (and on my own) into generalized linear models (GLM's), and I'm having trouble understanding the problem setting, sad, I know. This is what I figured out trying to read the book "Foundations of Linear and Generalized Linear Models" by Alan Agresti. For simplicity I will just assume nothing about the distribution of $Y$ and I will consider the identity link function.

We have a random variable $Y$ (response variable) and $p$ random variables $X_1,\dotsc,X_p$ (explicative variables), so let's define $\mathbf{X}:=(X_1,\dotsc,X_n)$. We state that there exists a vector $\beta\in\mathbb{R}^p$ such that $E(Y|\mathbf{X}=\mathbf{x})=\mathbf{x}^\top\beta$.

Well, this seems like a fairly reasonable idea, we are trying to find a functional relation between $Y$ and $\mathbf{X}$, that is, $Y=h(\mathbf{X})$ for some measurable function $h$, but, sadly, $h\circ\mathbf{X}\in\sigma(\mathbf{X})$ and, probably, $Y\not\in\sigma(\mathbf{X})$. However, we know that $E(Y|\mathbf{X})$ is the closest (in the $L^2$ sense) $\sigma(\mathbf{X})$ measurable function to $Y$. So we are just assuming that a linear function of $\mathbf{X}$ is the best we can get.

Our assumption is cool, but, back to the real world, we don't have a clue about who $\beta$ is, so we can take two different (equivalent?) approaches.

  1. Take $n$ independent observations of the random vector $(Y,\mathbf{X})$
  2. Pre-select certain values for $\mathbf{X}$, observe $Y$, and repeat this process $n$ times independently.

In any case, as a result, we have a vector $\mathbf{y}\in\mathbb{R}^n$, which corresponds to the $n$ independent observations of $Y$ and a matrix $\mathtt{X}\in\mathfrak{M}_{n\times p}(\mathbb{R})$ which corresponds to the $n$ independent observations (or the $n$ selections) of $\mathbf{X}$. Now, we want to estimate $\beta$ from $\mathbf{y}$ and $\mathtt{X}$, there are plenty of methods and heuristics to do this (and plenty of statistical theory to evaluate the "goodness" of the estimation), just to mention one of them, we may find the least squares solution $\hat{\beta}$ of the system of equations $\mathbf{y}=\mathtt{X}\beta$, and use it as our estimation.

Probably, everything you have read above is totally wrong, but please, help me, did I get something right?

1

There are 1 best solutions below

0
On BEST ANSWER

I'll try to give a valuable answer. I believe that your main question (i.e., method 1 vs. method 2 for data collection) is the difference between controlled experiment vs. observational data. Namely, in a control experiment you know what are your explanatory variables $X$ are because you choose them apriori and sometimes even manipulate them. In such a case, for $n$ independent "subjects", you observe $n$ realizations of $(Y, X)$ or $Y|X=x$.

However, for observational data, you basically don't know what are your explanatory variables $X$. So, although you still observe $(Y, X)$, but $Y$ and $X$ maybe independent.

Practically, assume that you are interested in the association of height $X$ and weight $Y$. So you observe $n$ realizations of $(Y, X)$. You can even switch between them. This is the first case. For the second case, you are interested in finding what determines weight $Y$. Now, you collect everything that you can from every "subject", thus your $X$ has many variables, some are relevant, other probably not. You still observe $n$ independent realizations of $(X,Y)$ since you cannot randomize your $X$s and $Y$, as every subject $i$ "brings" $(y_i, x_{1i}, ..., x_{pi})$. If you collect randomly $X$, and then heights $Y$, not from the same individuals, your model will have no meaning (whatever estimation method you use).