How to interpret parameter estimates in factor prediction ( in R )

83 Views Asked by At

So I have some data set in a .csv file and there are three factor levels, $1$ , $2$ , $3$, (there are fifteen of each) and each has a corresponding score.

Here are some details.

so the data is contained in a simple csv file, the first column is labelled Team, and the second column is labelled Score.

The first column consists of fifteen 1's, followed by fifteen 2's , followed by fifteen 3's.

The R code I used was

data.source<-"http.www.. " ( the data set)

SportScores<-read.csv(file=data.source)

I set x such that x prints 1 1 1.... 1 2 2 2 ... 2 3 3 3 .... 3 Levels 1 2 3

names(sportScores)

y<-SportScores$Scores

So using lm I get parameter estimates in R as

Intercept (35.800)

x2 (0.066)

x3 (12.40)

the t value are very large for intercept and $x3$, but very small for $x2$, ie it indicated to me that we cannot reject the null in this case, but what is the null? $$\beta_{0}=35.8$$ , $$\beta_{1}^{c}=0.06667$$, $$\beta_{2}^{c}=12.40$$

But how do I interpret this? I want to see any differences in scores between the 3 levels, etc. I mean, what even is the test being conducted? For example $$\beta_{1}^{c}=0.06667$$ has a small t value, so the null hypothesis is not rejected, but what even is the null hypothesis in this case? Moreover, from the code output itself, how can I know the associated individual standard errors of the estimated means?

1

There are 1 best solutions below

0
On BEST ANSWER

In R, the lm() function is very useful to perform a regression analysis on categorical variables. To understand how R manages the data with this command, we can remind that, if we have $m $ factors, each with $n $ observations, R starts from the basic equation of the classical random effects model

$$ Y_{ij} = μ + U_i + \epsilon_{ij}$$

where $Y_{ij} $ is the value of the $j^{th} $ item of the $i^{th} $ factor, $\mu $ is the average score for the whole population, $U_i$ is the factor-specific random effect, and $\epsilon_{ij}$ is the individual-specific effect. To make an example, let us suppose that $ m$ soccer teams are randomly chosen among all teams of the world, and that $n$ players are randomly chosen from each selected team. The performance scores for each player in a given year are collected. Applying the random effect model, $Y_{ij} $ is the score of the $j^{th} $ player of the $i^{th} $ team, $\mu $ is the average random score for the entire population, $U_i$ is the random team-specific effect, and $\epsilon_{ij}$ is the player-specific effect. In this model, the term $U_i $ quantifies the difference between the average score of the team $i $ and the overall average score observed in the entire population. It is defined as a "random" effect because each team have been randomly selected from a larger number of teams. In your case, these considerations can be directly applied to your three factors/levels, each with $15$ observations.

However, you have to consider that the lm() function of R typically uses a reparameterization , commonly called the "reference cell model". Here, one of the $U_i$ (usually the first) is set to zero and is used as a reference. In this approach, which we could be write as

$$ Y_{ij} = μ^* + U_i + \epsilon_{ij}$$

the mean of category $1$ is taken as the intercept $\mu^*$, and the term $U_i $ measures the difference between the average score of the team $i $ and the average score observed in the reference category $1$. So, looking at the R output in your question, the intercept corresponds to $\mu^*$ (the mean of the first level), the coefficient $x_2$ is an estimate of the difference in means between level $2$ and level $1$, and similarly the coefficient $x_3$ is an estimate of the difference in means between level $3$ and level $1$. Note that the output does not include any $x_1$ coefficient, just because the first level is the reference. Also note that in your output $x_2$ is rather small and $x_3$ is large; accordingly, the $t $ value is large for $ x_3$ and small for $x_2$. This means that the difference between level $3$ and the reference level $1$ is probably highly significant, whereas that between level $2$ and the reference level $1$ is probably not significant (however, to correctly assess significance, you have to look at the p values, which are given in the R output). The high $t $ value of the intercept simply expresses the significance for testing the difference between the mean of the first category and $ 0$, and therefore is not particularly useful.