What is dot product attention actually doing?

513 Views Asked by At

I've read multiple papers on machine learning attention mechanisms, but I fail to really understand what is going on from a basic level. I've never seen anything that deeply explains the concept of queries, keys, and values with real examples. Below are some examples of papers...

[1] - https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

[2] - https://arxiv.org/pdf/1901.05761.pdf

An equation that is in both papers is the one below which is described as dot product attention. How can I really understand what the result of this means for the involved $Q$, $K$, and $V$?...

$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{d_k}\right)V $$

There is also another variant which they called Laplacian attention which is defined as..

$$ \mathrm{Laplace}(Q, K, V) = WV \in \mathbb{R}^{n\times d_k}, \;\;\;\;\; W_i = \mathrm{softmax}((-|| Q - K ||_1)_{j=1}^n) \in \mathbb{R}^n $$

I understand all of the processes involved, but I don't understand what the end result of these is and I feel like it is interfering with my ability to grasp papers that deal with attention

2

There are 2 best solutions below

0
On

This is quite hand-wavy, but hopefully explains the rough idea.

The way to think about dot product attention is to imagine $Q$ is asking questions about the context of a word and has an orthonormal basis corresponding to the base questions. $e_1$: ''am i about water?'', $e_2$: ''am I about financial services?'',$\dots$. $K$ has a canonical orthonormal basis such that $\langle e_i, f_j\rangle=\begin{cases}1\text{ if }i=j,\\0\text{ o/w}\end{cases}$ essentially responding in the affirmative. $f_1$: ''you are about water'', $f_2$: ''you are about financial services'' etc. $V$ is then the vector we assign to the word having found its context. So for example the first element of an orthonormal basis may correspond to the value a bank gets when the previous word river has made us decide it's a bank to do with water, not money.

6
On

In each case you've listed the aim is to take a "query" Q, which is a discretization of a question and determine an answer to it. The question is discretized into keys K, which are associated with values V. Everything after that is some way of evaluating the question.

So, suppose we are training a neural net to identify cars. The question that this neural net sets out to answer is "Is this a car?" We can discretize this, for example, by identifying key features:

  • headlights
  • windscreen wipers
  • less than 3 wheels
  • 3 wheels
  • 4 wheels
  • more than 4 wheels
  • radiator grill
  • handlebars

and there may be more, but for this example we'll stop here with 8 features. The idea of attention is that our matrix $Q$ tells us which features correlate strongly with others, i.e. what we should also pay attention to when considering a particular feature. So when we consider handlebars (last row and last column of $Q$) we would pay much more attention to "less than 3 wheels" than to any of the other wheel-features, and we would get a vector (for that row/column) that looks like: $$Q_8 = [0,0,1,0,0,0,0,1]$$

I'm going to propose that the full matrix $Q$ will look like this, and you should consider how much you (dis-)agree:

$$Q = \left[ \begin{array} 11 & 0 & 0.5 & 1 & 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 & 1 & 1 & 0 & 0 \\ 0.5 & 0 & 0.5 & 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 1 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 \end{array} \right] $$

The key vector has been trained on cars with varying degrees of damage (broken headlights, missing windscreen wipers, etc.) and so might look like $$K = [0.8, 0.66, 0.07, 0.35, 0.96, 0.05, 0.75, 0.00]$$

and our values V, associated with the key vector tell us how strongly these keys weight in our evaluation of whether this is a car or not. For the purpose of example, let's set those values as $$V=[0.6, 0.9, 0.05, 0.25, 0.95, 0.01, 0.5, 0.01]$$

The particular formulae presented then will vary depending on: - computational resources -- a dot-product can be done very efficiently on computer hardware, but may (as noted in the google paper) become inefficient without some kind of scaling factor on large datasets - the particular problem being addressed -- identifying a car is a different kind of problem to machine translation and a different Attention$()$ formula might work better for one than another - your personal preferences and ability to analyse the mathematics that results.