Optimize database indexes, the sequel

95 Views Asked by At

Background

For those not familiar with databases. Indices help to speed up database searches, but they come at the cost of memory. Since you like your database to be fast, they are best cached in RAM memory (not 4 TB slow HDD's).

But you can not have an infinite amount of indexes because they take up quite a bit of memory.

The tradeoff is the amount of memory you can use versus the most database access commands (queries) having fast speed.

Question

I have $X$ columns, how can I pick $Y$ indices to maximize my coverage. A history of past queries and their occurrence is known.

Rules

  • A query consists of 1 or more columns.
  • An index consists of 1 or more columns.
  • Query order is not important
  • Index order is important
  • Index can be used for a query if the first columns of an index match those of the query

Example

Suppose our database consisted of 6 columns, and we have 5 queries, amount of wanted indexes: see below

Columns: 1, 2, 3, 4, 5, 6 (column 6 is never queried in this example, so it doesn't count)
Query A: 1, 2, 3, 4 -- occurs: 3 times
Query B: 2, 3, 4, 5 -- occurs: 68 times
Query C: 2, 3 -- occurs: 12 times
Query D: 3, 4 -- occurs: 45 times
Query E: 2, 3, 1 -- occurs: 6 times
Total amount of queries: 134

Query E.

Can A be used for E? Yes, when ordered like: 2, 3, 1 ... trivial from here, example: 2, 3, 1, 4

Can B be used for E? No (1 does not appear in B)

Query C.

Can A be used for C? Yes, when ordered like: 2, 3, ... trivial from here, example: 2, 3, 1, 4

Can B be used for C? Yes, when ordered like: 2, 3, ... trivial form here, example: 2, 3, 4, 5 (as is)

Can E be used for C? Yes, when ordered like: 2, 3, ... trivial from here, example: 2, 3, 1

Query D.

Can A be used for D? Yes, when ordered like: 2, 3, ... trivial from here, example: 3, 4, 1, 2

Can B be used for D? Yes, when ordered like: 2, 3, ... trivial form here, example: 3, 4, 1, 5

Can E be used for D? No (4 does not appear in E)

Possible combinations to cover everything:

  • We have two indices
    • 2, 3, 1, 4 (covers A, E and C).
    • 3, 4, 1, 5 (covers D and B)
    • Efficiency: 2 indices for 5 queries has its efficiency be $$\left(1 - \frac{2}{5}\right) 100\% = 60\%$$
  • We have three indices
    • 3, 4, 1, 2 (covers A and D)
    • 2, 3, 4, 5 (covers B and C)
    • 2, 3, 1 (covers E)
    • Efficiency: 3 indexes for 5 queries has its efficiency be $$\left(1 -\frac{3}{5}\right) 100\% = 40\%$$
  • Four indices
    • 1, 2, 3, 4 (covers A)
    • 3, 4 (covers D)
    • 2, 3, 4, 5 (covers B and C)
    • 2, 3, 1 (covers E)
    • Efficiency: 4 indexes for 5 queries $$\left(1 -\frac{4}{5}\right) 100\% = 20\%$$
  • ...and more combinations...

Now that we have all combinations and sorted by efficiency. We have a requirement that we can only use one index!

With an even occurrence among all queries the winner would be: 2, 3, 1, 4 because it covers 3 queries (A, E and C). It just so happens to be that queries A, E and C only occur 3, 6 and 12 times. Where query B and C occur 68 and 45 times, which makes the winner: 3, 4, 1, 5 because it covers $$\left(\frac{68 + 45}{134}\right) 100\% \approx 84\%$$ of the total query pool.

Now we can put a maximum of 2 indices, we are lucky: every possible query is covered with just 2 combinations (the ones at the first bullet point).

In reality we would have dozens of indices, thousands of queries, and lots of columns (no more than about 250 though). The chance of every possible query being covered with a limited amount of indexes is estimated to be small.

1

There are 1 best solutions below

1
On BEST ANSWER

Your problem is the weighted version of maximum coverage problem. It is NP-complete problem, the greedy algorithm achieves $1-\frac{1}{e}\approx 63\%$ approximation rate, and it's probably best what you can do in polynomial time (exact algorithm would be exponential in the number of indexes, approximation better than $1-\frac{1}{e}$ is also hard).

First, you have to consider only the permutations that come from some queries. Let $w_q$ be the number of times query $q$ appeared, and set $U_q$ as the set of queries that could be solved by $q$ (i.e. $q$ can be used for it). You want to pick $k$ queries $q_1,\ldots,q_k$ maximizing $\sum_{q \in \bigcup_{i=1}^k U_{q_i}} w_q$.

You can try simplex algorithm and then round the solution to get approximation of integer-programming solution, or you can try the greedy algorithm (at every step pick the query that maximizes the number of requests covered by all picked queries).

I know NP-completeness isn't what you wished for, but nevertheless I hope this helps $\ddot\smile$