I need to train several Logistic regression models on a different set of data (with a different set of labels):
Train data of model 1:
[X1, X2, X3] -> y1
[X4, X5, X6] -> y2
...
Train data of model N:
[X99, X101, X102] -> y8
[X103, X104] -> y9
As a model, I choose Logistic regression with a One-vs-Rest train scheme. When I get an input from the user, I need to determine which label was most probable (from all N models). So what I am trying to do is query trained models with validation data and calculate mean MU and standard deviation STD. Based on this value I will normalize predictions to be comparable between each other - subtracting mean and divided by standard deviation - is the assumption correct?
# normalization
normalized_predictions = [(pr - g["pos_mu"]) / g["pos_std"] for g, pr in zip(self.gaussians.values(), prediction)]
# example output of each Logistic regression model
output.append({
'answer': self.idx2label[idx],
'confidence': normalized_predictions[idx],
'hit': ""
})
...
# query and normalize - normalization is shown above
responses = query_log_reg_model_and_normalize()
# sort based on confidence value
responses = sorted(responses, key=lambda r: r['confidence'], reverse=True)
# return most probable
return responses
```