Math Algorithm Needed with Aggregated Percentages (Business Scenario)

65 Views Asked by At

I'm trying to figure out a math algorithm, but I'm having a hard time wrapping my head around the best way. Instead of defining a complex industry-specific process, I'll construct a simplified scenario here (the math is what's important).

Imagine you're trying to identify the percent chance a specific make of car is parked in a store's parking lot based on the items sold within the store. To begin you take a physical survey of $100,000$ store parking lots, recording each unique car make spotted outside, each unique item sold within the store, and a fixed percent relevance that item has to the store (ex: lumber has an $89\%$ relevance to Home Depot, but pencils only have a $23\%$ relevance to Walmart).

There are two parts to what I’m trying to solve. First, I’m trying to figure out the best way to roll-up this data to a specific item, while respecting each relevance percent and the number of confirmed observations (so one spotting doesn’t equal $100\%$ chance, similar to this). In other words, if a brand new, never-before-seen store is selling Waterford glasses and cashmere sweaters, from those items we can predict there’s an 89% chance a Mercedes is in the parking lot.

So to recap: Each item has been seen a specific number of times in a store. For each of those times, there is a different product/store relevance percentage and a list of all car makes in the parking lot. How do I best mathematically calculate the percent chance a specific make is in the parking lot of a brand new store, only based on the items within?

Now the second part of this is getting a bit more complicated by adding another layer of abstraction. If a single person visits $50$ stores, and we aggregate all the items in all those stores, we can predict what type of car they drive (ex: lots of camping and hiking stores, so they have a $67\%$ chance of driving a Jeep). Then if they visit a new store and are exposed to a brand new item, for which we have no data, I need to apply that $67\%$ Jeep onto the new item (still respecting the relevance of that item to the store). Then use that item’s less-than-certain Jeep statistic to influence our predictions of parking lots that contain that new item (which was never directly measured). Perhaps this requires us to add a confidence interval of some kind? Or how can we represent that uncertainty, without every one of the millions of items we analyze eventually averaging out to $50\%$?

I really appreciate your help on this!