What model should I use for judging a dimension given only composed data with another?

60 Views Asked by At

I am attempting to upgrade a modeling system using a limited type of statistical information, but with the sample covering the entire system. The problem is how to use the additional information in the upgraded system to make the desired improvements.

Original system:

A certain racing company has facilities to maintain a set of up to n racers and m cars where at any given time, min{m,n} is the number of racers that can be racing. Before a race, racers select their cars. The method to how they are matched is unknown. It could potentially be somewhere between following a rigid set pattern, such as each racer always selecting the same car if possible, and uniformly random selection, where each racer may select a car with equal chance of selecting them. However, neither of those particular cases, though it may be useful to assume them to make the problem easier, actually represent reality. The racing company rates its racers by keeping track of the time it takes to complete different racetracks. The company makes more profit the better their drivers do. We can assume it is linearly proportional. New racers are always applying, and it may be safe to assume that their potential skill must be inferred from the distribution of the history of past newly hired racers. A new hire, for example, could be expected to be the mean or the median of that distribution if it makes the problem easier. Some cars in the lot are much better than others, giving a bias to how the racers are judged.

Suppose it costs c to train a potential new hire, accounting for risk of quit.

We assume new hires will stay 5 years on average. If a new hire is expected to make y_new profit per year, then we expect him to make 5 * y_new total. If not replaced, we assume the current employee will stay for those 5 years, providing 5 * y_old total. If

5 * y_new - cost - (5 * x_new) > 0

Then the current racer is fired and replaced with the new hire. The way the cars are judged is naive, where once a season, only the best and worst cars are timed then averaged and decisions to replace cars are based on that or rumors on the quality of the cars.

Upgraded system:

The racing company wants to make two improvements by accounting for more data kept by the races:

Let run identify a particular racer's stats for a given race.

Let time indicate how long it takes a racer to get to the finish line.

Data: Instead of retaining the data:

{unique_run_identity, track_identity, time, racer_identity} for each run, the company starts to retain which car was used:

{unique_run_identity, track_identity, time, racer_identity, car_identity}

Improvement 1: The company wants to more justly or accurately judge the racers performance accounting for which car they used.

Improvement 2: The company wants to judge the cars as well, not just the racers.

Improvement 2 is automatically solved when improvement 1 is solved since they are determined the same way, just swapping racer_identity and car_identity in the final algorithm. However, improvement 1 will have to be approximated in some way. There may be conflicting skills based on cars, for example, racer A may have gotten a better time with car 1 than racer B did with that car, but racer B did better with car 2. While a system in the future that attempts to assign cars to racers may exist, it may be easier to simply assume the racers are linearly ranked and judge them accordingly assuming that they may be assigned a random car or, more sophisticated sense, the model may base its judgements on predicting what cars racers are more likely to select. While those solutions may be better, it is assumed that if improvement 1 is made, then, even with long term racers being used to using their particular cars, new racers will better be selected for the company in the long run since they are being accounted for which car they use.

1

There are 1 best solutions below

1
On BEST ANSWER

A fixed-effects model may be useful here. Let $t_{pcrn}$ be completion time for a given person $p$ driving in car $c$ on track $r$ for run $n$ (assuming there can be multiples of each triplet $pcr$.

We could model $t_{pcrn}$ using a three-factor, additive, fixed-effects model:

$$t_{pcrn} = \mu + \phi_p+\alpha_c + \beta_r + \epsilon_{pcrn}$$

Where:

  • $\mu$ is the "grand mean" of all times in the database (i.e., all $t_{pcrn}$)
  • $\phi_p$ is a single number "fixed" effect due to the driver being person $p$: positive numbers indicate they are slower than average, after accounting for effects from the track and car.
  • $\alpha_c$ is the fixed effect for the car: positive numbers mean the car is slower than average, after accounting for different drivers and tracks.
  • $\beta_r$ is the track-effect: Some track are slower than others ie., $\beta_r>0$.
  • $\epsilon_{pcrn}$ is the error term: a $0$-mean random variable that is assumed to be iid for each observation, regardless of person, car, or track. This is often a normal distribution with mean=0 and some standard deviation $\sigma$.

The key to this is picking some distribution for $\epsilon_{\sigma}$. Then, you can derive maximum likelihood estimates for each parameter above and some approximate standard errors, to be sure you are taking into account your uncertainty in their measurement.

If we assume that $\epsilon$ is bell-shaped to some degree then you can fit this using ordinary linear regression by regressing the time against all the factors, where the independent variables are just "1/0" codings specifying which factor is in play:

$$t_{ijkl} = \mu + \sum_{w=1}^{N_p} \delta_{wi}\phi_q +\sum_{x=1}^{N_c} \delta_{xj}\alpha_x+\sum_{y=1}^{N_r} \delta_{yk}\beta_y+\epsilon_{ijkl}$$

Where $\delta_{sv}$ is the Kroneker Delta.

Under the assumption that the errors are approximately normal, the above is equivalent to saying that:

$$t_{pcr} \sim \mathcal{N}(\mu+\phi_p+\alpha_c+\beta_r,\sigma)$$

For some $\sigma$ (which we can estimate as the standard deviation of the residuals...there are other ways too).

Keep in mind, that this is a rather simple model, so things just get more complex from here: e.g., we did not assume there were any synergies between drivers and cars (reflecting familiarity with a car), or drivers and tracks (reflecting familiarity with the track). However, interactions add to complexity and may be overkill or have smaller effects that don't really matter.