You are given a data file called Calculus_data.tab(a huge file)
We take a random sample of 30 from the data file.
The Question:
Estimate the difference between SATQ scores for male and female students. (At this point in your statistical education, you should interpret estimate to mean that you should find an interval estimate; for example, a confidence interval.) What conclusions can you draw from this estimation? What conclusions can you not draw, but some people might be tempted to draw?
Pls dont answer the question I just want tips. Just using the question for reference.
My professor gave me a r script that goes as such.
rm(list=ls())
library(tidyverse)
setwd("C:/Users/jj948/OneDrive/Desktop/R modules")
calc_data <- read.delim("Calculus_data.tab") %>%
filter(!is.na(SATQ), !is.na(ACTMath))
calc_data_female <- filter(calc_data, Sex=="W")
calc_data_male <- filter(calc_data, Sex=="M")
n <- 30
samples_female <- calc_data_female[sample(1:nrow(calc_data_female), n), ]
samples_male <- calc_data_male[sample(1:nrow(calc_data_male), n), ]
If you summarize samples_female:
$ Semester : int 20169 20159 20169 20179 20169 20149 20149 20149 20159 20149 ...
$ Calc_semester: Factor w/ 2 levels "first","second": 1 2 1 1 1 2 1 1 1 1 ...
$ Grade : Factor w/ 15 levels "A","A-","B","B-",..: 14 6 14 6 1 5 1 13 6 5 ...
$ Grade_grouped: Factor w/ 6 levels "A","B","C","D",..: 6 3 6 3 1 2 1 5 3 2 ...
$ Grade_GPA : num 0 2 0 2 4 3.33 4 0 2 3.33 ...
$ SemSchName : Factor w/ 12 levels "ARCHITECTURE",..: 9 8 9 9 9 8 9 9 12 2 ...
$ FirstSem : int 20169 20159 20169 20179 20169 20139 20149 20146 20159 20149 ...
$ Race : Factor w/ 19 levels "American Indian or Alaska Native",..: 8 19 19 8 16 19 19 8 19 19 ...
$ Sex : Factor w/ 2 levels "M","W": 2 2 2 2 2 2 2 2 2 2 ...
$ ParIncome : Factor w/ 9 levels "149999","19999",..: 3 9 6 1 8 3 4 7 5 3 ...
$ FatherEd : Factor w/ 8 levels "0","1","2","3",..: 5 6 5 6 5 7 6 1 5 5 ...
$ MotherEd : Factor w/ 8 levels "0","1","2","3",..: 5 5 5 5 6 7 6 3 6 7 ...
$ SATQ : int 540 NA 570 NA NA 620 740 510 680 630 ...
$ SATV : int 580 NA 620 NA NA 570 580 610 720 610 ...
$ ACTMath : int NA NA NA 27 23 27 NA NA 27 28 ...
$ ACTEng : int 0 0 0 35 35 24 0 0 33 26
I was going to compare both predictive models for males and females but yeah I was using too much factors. Is there anyway I can use better factors for these models? What do you guys think I should be focusing on out of the factors and integers you've seen earlier? Any factors I should be omitting?