I have data that map several nominal variables and one real parameter into a real value. For example:
('A', 'left', 'male', 'dog', 1.3459) -> 3.453
('A', 'top', 'male', 'dog', 6.3459) -> 6.137
...
('C', 'right', 'female', 'cat', 4.726) -> 1.456
I need to use these data to fit a function so that I can predict values for new input. For example:
('C', 'top', 'female', 'cat', 0.3459) -> ?
Fortunately, within a good approximation the output is a linear function of the real argument: y = c + k*x. However, out hypothesis is that we can improve the quality of the fit if we assume that the parameters of the linear fit ("c" and "k") depend on the nominal variables.
My questions is: How to find dependency of the fit parameters on the nominal variables?
The first idea is to consider all possible combinations of the value of the nominal variables and for each combination perform a fit. Let us say, we take all the data for ('A', 'left', 'male', 'dog') and perform the linear fit. Then we do the same for ('A', 'left', 'male', 'cat'). However, this approach will not work since I have a lot of combinations that have no data points or have a small number of points.
Alternatively, I can make independent fits for different values of a fixed nominal variable and ignore the other variables. Then I can do the same for the second nominal variable and so on. But then the question is how to combine these independent fits.
So, what would be you approach?
Here is one approach. First, for each training example calculate the expected output using just the real value and the learnt linear relation. Next subtract the real output from this estimate to give you your prediction error. You can now re-write your training set as:
ie, with just the nominal variables as inputs and the error resulting from the learnt linear approximation as outputs.
You now want to learn the mapping between nominal variables an error of linear prediction.
My initial suggestion would be to learn a decision tree [1] to estimate just c (using your original k). Then using these modified c values learn a new decision tree to estimate the gradient k. You might want to repeat these two steps a few time ie, now re-estimate the c values using our latest k values, and so on. I suspect (/hope!) this process would converge, and the final two decision trees (one for c and one for k) can be used to predict c and k for a give input set of nominal variables.
I hope that helps - or at least gives you an approach to consider.