Optimizing expectation, unknown parameters, normal distribution. How many restaurants should I try before choosing one for the rest of my n-m meals?

93 Views Asked by At

Suppose you move to a new city with an infinite number of restaurants and you plan to stay there for a predetermined amount of time. You plan to have n meals at restaurants over the course of your stay in the city. Assume there are no reviews on any of the restaurants and the only way to determine how good a restaurant is is by eating there and giving it a rating. Also assume the quality of restaurants follows a normal distribution but you don't know the parameters, µ and σ, of the distribution. Your goal is to optimize the expectation of your combined restaurant experiences. You do not value variety and would happily eat every single meal at the best restaurant if you could find it.

For example, if you simply ate at a different restaurant for every meal you would have an expected combined experience of n$\times$µ. Also if you had a meal at a random restaurant then decided to have all your meals there without trying any others you would again have an expected combined experience of n$\times$µ.

But you could improve upon that by trying two restaurants then choosing the better of the two and having all your remaining meals there. Then your expected combined experience would be (n-1)$\times$a1+a2. where a1 is the expected rating of the better of the two restaurants and a2 is the expectation of the worse restaurant. (What would a1 and a2 be in terms of µ and σ for this case? I know (a1+a2)/2=µ but don't know how far apart they would be). You could improve further by trying 3 restaurants and choosing the best of those and so on. If you sampled m restaurants before settling on one for your remaining meals your combined expectation would be (n-m+1)$\times$a1+a2+a3+...+am where again a1 is the highest expectation of these m drawings.

Main question: How many restaurants, m, should you try before picking the best of those restaurants for your remaining n-m meals?

2

There are 2 best solutions below

0
On BEST ANSWER

Use dynamic programming.

Let $x_{t-1}$ be the value of the best restaurant you have so far visited in period $t$. Then you can sample a new restaurant or stick with the best you already have.

In the final period, $t=n$, the value function is $$ J_n(x_{n-1}) = \max\{ \mu, x_{n-1} \}. $$ and you should search if $\mu > x_{n-1}$. So, go to your existing favorite or try something new.

Backwards inducting, at a period $t<n$, you have the same kind of functional equation, $$ J_t(x_{t-1}) = \max \{ \mathbb{E}[ J_{t+1}(x_t)|x_{t-1}], (m-t)x_{t-1} \} $$ So the optimal policy is to stop (which is an absorbing state) if $x_{t-1} \ge (\mathbb{E}[J_{t+1}(x_t)|x_{t-1}])/(m-n)$. Because of the search pattern, $$ \mathbb{E}[J_{t+1}(x_t)|x_{t-1}] = \int_{-\infty}^{x_{t-1}} J_{t+1}(x_{t-1})dF(z) + \int_{x_{t-1}}^\infty J_{t+1}(z)dF(z) = J_{t+1}(x_{t-1})F(x_{t-1}) + \int_{x_{t-1}}^\infty J_{t+1}(z)dF(z). $$ So the break-even $x_{t-1}$ satisfies $$ (m-t) x_{t-1} = J_{t+1}(x_{t-1})F(x_{t-1}) + \int_{x_{t-1}}^\infty J_{t+1}(z)dF(z). $$ You can solve that numerically for the optimal break-even $x_{t-1}$, given $J_{t+1}$.

Given the optimal policy fcn $J_{t+1}$, you can solve for the expectation numerically and do another round of backwards induction, all the way back to $t=1$.

The key is that experimenting early is valuable, even if you already have a pretty good restaurant, even substantially above $\mu$, but as time goes on, you will prefer to pick a better-than-$\mu$ restaurant rather than risking a bad meal. If you add discounting and take $n \rightarrow \infty$, you get a simpler problem: the functional equation is $$ J(x) = \max \left\lbrace \dfrac{x}{1-\beta}, \mu + \beta \int_{-\infty}^x J(x)dF(x') +\beta \int_{x}^\infty J(x')dF(x')\right\rbrace $$ That equation has a unique solution $J^*$, and from that you can solve for the optimal cutoff $x^*$, which will be independent of calendar date: you search until you get a restaurant better than $x^*$, then stop.

Also look up Bellman equations and discrete time dynamic programming.

Sorry, I got excited about the solution and initially didn't understand what you meant by not knowing the parameters, but figured it out once I got to the end. It's easy to add that. Put $\theta = (\mu,\sigma^2)$ into the value function and use Bayes' rule to update from period to period. There are a lot of ways to approach this, including https://en.wikipedia.org/wiki/Recursive_Bayesian_estimation and https://en.wikipedia.org/wiki/Kalman_filter

You'll probably want to use a collocation method to treat the state space as continuous instead of discretizing, and then use quadrature to compute the expectations.

1
On

This is more of a follow up question than an answer: it makes intuitive sense that we will not be willing to go back to exploration after we exploit for a period. However, the result above does not show it formally.

The optimality equation is, more generally: $$J_t(x_{t−1}) = max \{ E[J_{t+1}(x_t)|x_{t−1}],\ \ J_{t+1}(x_{t−1})\}$$

How can it be formally proven that, after a period where we exploit, it will never be optimal to go back to exploration?

For more details on this issue, I posted a proper question here