I am currently building multi-variable linear regression model for water consumption prediction.
$Y$ is monthly household water consumption. I have demographics data for each household for my predictor variable $X_i$. They are:
- individual age ,
- gender,
- race,
- marital status,
- building age,
- No. of rooms.
I have separated the total Number of people into three categories:
- Number of adult,
- Number of children,
- Number of elderly.
then I build the model: $$ \begin{split} \text{consumption of household}& \sim \text{Number of adult}\\& + \text{Number of children}\\& + \text{Number of elderly}\\& + \text{Number of female}\\& + \text{Number of unmarried}\\& + \text{building age}\\& + \text{No. of rooms} \end{split} $$ (ignored the coefficients) My confusion is that the Number of adult + Number of elderly + Number of children = Total Number of People. Should I include Number of female and Number of unmarried into the model also? but if not including them, the model will lose those information right? Please help me to clarify those doubts. Great appreciation in advance!
Including variables that are linearly dependent will lead to a model that does not have a unique solution and should usually be avoided. So in your example, if you already have No of children, adults and elderly you don't also want a variable for total number of people. This variable would not give you any new information.
But No of females and no of unmarried are both just some subset of total number of people, so including these does provide new information to your model.