I am not very good at math, but am decent with databases. I have a thorny problem: given a set of 2 values and a category, try to create a model that would allow the prediction of the category based on a balue set
For example, I have a table that has height, weight, and gender. What I need tio be able to do is take an input of height and weight, and try to predict the gender. There is always new data being fed into the system, so there mean and standard deviation can change.
I'm not sure how to approach this. Initially I figure that for each gender I would find the mean and standard deviation, and then add add the standard deviation to the mean for an upper bound, and subtract it for a lower bound. I would end up with a range of heights for men, and one for women. If the given height fell into the men's side, then that would be +1 for guessing "male", ditto for women.
But that won't work. First, how would I handle the overlap? Second, we also have a weight variable to work with.
I am unsure of how to wrangle both height and weight to make a prediction, and it's been years since I last took any sort of math class. I vaggue remember multiplying two independent probabilities to get the probability of an occurrence, but I'm not sure if that's what I'm looking for. I do also vaguely remember the concept of regression to try to determine how correlated data is, but again...is this what I want? Any help would be appreciated.
EDIT: Thanks to everyone for the responses. I settled on creating a slope for each category. So there is a slope for men and a slope for women. My rationale would be that if an xy value set is above the slope for men, the predicted gender is male; if it's below the slope for women, the predicted gender is female; if it falls between, then I will go with whichever slope it is closest to. From a mathematical perspective, does this make sense? if it does, then how do I check values against the slope. For example, if my slope for women is 0.1138 and my slope for men is 0.1067, how would I go about determining where a weight(x) height(y) input of 100/61 sits for each slope?
Since you are trying to predict a categorical variable you can employ logistic regression. See: https://en.wikipedia.org/wiki/Logistic_regression