Do you know any layered shortest path discovery algorithm?

83 Views Asked by At

I am working on a instant search engine, this engine is aware of the positions of the queried words. This is important as one of the main feature is to answer with documents that contains the query words at the best proximity possible (i.e. with a pattern that looks like the query).

The engine is meant to answer in less than 50ms.

I would like to find the shortest path to move from the first query word to the last query word, in order. Each word appear at given positions in the whole dataset. We group the words by the positions they appears at.

For example, "william" appears at 7 different positions among all documents, "cowboy" at 5 positions and "the" is one of the very popular words and therefore appears at many different positions.

william   0 12 13 14 16 18 32
the       1  2  5  6  8  9 11 ...
cowboy    2  5 21 23 32

When a user search for "william the cowboy" we want to return the documents containing the words ordered by path proximity (a.k.a. path length). Proximity is computed based on the order of the query words and those found in the documents.

For example the best proximity that can exist is "william" in position 0 followed by "the" in 1 and "cowboy" in 2, which gives a proximity of (1 - 0) + (2 - 1) = 3. But as you probably already guessed we don't know if there is documents which exists with the given words at these positions, we must check that.

This is why we also store the documents ids for each word and position.

william in 0   156 674 764 793 798 801
the     in 1   112 156 256 325 421 674 798 801 ...
cowboy  in 2   112 201 674 798 801

If we follow the path 0-1-2 we find out that there is only 3 documents containing those words in those exact positions, the documents 674, 798 and 801. We obtained this by doing an intersection between the positions and words of the shortest paths found.

We must find the next shortest path because there isn't enough documents to return to the user. We also would like to be able to combine multiple criteria, more than just the global proximity of the path but also the words that compose it (to be aware of the number of typos for this path).

Current implementation

The A* bag

We chose to use an A* to find the shortest path between words positions, a node represent a move from one layer (e.g. "the" in 1) to the next with the parent position (e.g. "william" in 0) and the accumulated proximity. Only successors which produce a valid path are explored (i.e. contains documents).

To be able to detect valid and invalid paths we need to compute the intersection between the parent position and the position we are testing at the layer we are. We obviously cache most of the work we do for every request.

  • A* gives us all of the shortest paths with the same proximity.
  • we retrieve the documents corresponding to these paths.
  • if there isn't enough documents we run the first step but with a bigger proximity.

This solution is currently the best one we found but as A* is a determinist algorithm it will sometimes for some request take to much memory that it is killed by OOM, and also take so much time that the user is probably gone already.

bag defines that the A* returns the N shortests paths not only one of the shortests.

The Ant Colony Optimization

We also tried to use an Ant Colony System to find the shortests paths.

The advantage of this solution is that this is an evolutionary algorithm, which means that the solution that is given should improve over time, therefore it can be ran for as long as the query doesn't exceed 50ms.

However there is some problem with it too:

  • most of time the A* find better results than the Ant Colony.
  • the Ant System parameters (⍺, β...) must be fine tuned to find relevant results.

Notice that I haven't finished the Ant Colony Optimization solution yet, but I am sure there is a better algorithm to do what I am trying to achieve. Maybe even just a Max Min Ant System would be enough.

I would also like to try a multi-criterion Ant Colony Optimization to be able to also take more parameters than just the proximity of the path (e.g. number of typos).

One thing to note is that most of the time is spent doing intersection between documents ids for both algorithm, so I concluded that I should probably find a way to reduce the amount of intersection. Maybe packing documents ids by ranges of positions: we union "william" from position 0 to position 10, this way when we want to intersect between "the" 1 and "william" 0 or 5 we can use the precomputed union first.

Do you know of any other algorithm that could better solve my problem?