Searching Algorithm

902 Views Asked by At

A company database has 10,000 customers sorted by last name, 20% of whom are known to be good customers. When looking up a customer’s record in the database, the good customers account for 60%.

Two design options are considered to store the data in the database:

  1. Put all the names in a single array and use binary search.
  2. Put the good customers in one array and the rest of them in a second array. Only if we do not find the customer on a binary search of the first array do we do a binary search of the second array.

Given the above two options, answer the following: i. Calculate the expected worst-case performance for each of the two structures above, given typical usage. Which of the two structures is the best option?

For this, I believe the worst case for binary search would be $\log_2(10000+1)$ which rounds up to 14. For the second scenario, it would be $\log_2(10000*.2+1)$ which is 11 if the customer is good and $\log_2(10000*.8+1)$ which is 13 if the customer is not good hence this would be the better scenario. Am I correct on this?

ii. Suppose that over time the usage of the database changes, and so a greater and greater fraction of lookups are for good customers. At what point does the answer to part i change?

If part i) was correct, then there is no scenario where x in $\log_2(10000*x+1)$ is greater than 15?

iii. Under typical usage again, suppose that instead of binary search we had used linear search. What is the expected worst-case performance of each of the two structures and which is the better option? Where is the cross-over this time?

1

There are 1 best solutions below

4
On BEST ANSWER

You have two terms next to each other there that perhaps need clarification; "expected" and "worst-case". If I assume that "worst-case" means that the binary searches don't get lucky and "expected" means that we are working across a typical mix of customer orders, your answer to (i) is $0.6\cdot 11 + 0.4 \cdot(11+13) = 16.2 > 14$, so the separate lists are not worthwhile.

For (ii), in order to get the crossover point, define $k$ as the proportion of good customers and set $14 = k(11) + (1-k)(24) = 24-13k$, giving $k \approx 0.769$. So if we have 77% (or more) of orders from the good customers (which are still only 20% of the total customer base), separating the list for search purposes will be more efficient.

For (iii), I haven't done any calculation but I strongly expect that looking at the good customers first will always be preferable.


So adding for (iii), "worst-case" in the same sense as the binary search doesn't really work. Expected search length for the full list is $5000$ and for the split list is $0.6\cdot 1000 +0.4 \cdot (2000+4000) = 3000$, so the split list is easily better. Using the same approach as above to find the crossover, $5000 = k\cdot 1000 + (1-k) \cdot (2000+4000) = 6000-5000k$, that is, $k=0.2$ - the proportion of the good customers, as you would expect (if they're not buying more than anyone else, they're not really "good" any more).