I've read multiple papers on machine learning attention mechanisms, but I fail to really understand what is going on from a basic level. I've never seen anything that deeply explains the concept of queries, keys, and values with real examples. Below are some examples of papers...
[1] - https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[2] - https://arxiv.org/pdf/1901.05761.pdf
An equation that is in both papers is the one below which is described as dot product attention. How can I really understand what the result of this means for the involved $Q$, $K$, and $V$?...
$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{d_k}\right)V $$
There is also another variant which they called Laplacian attention which is defined as..
$$ \mathrm{Laplace}(Q, K, V) = WV \in \mathbb{R}^{n\times d_k}, \;\;\;\;\; W_i = \mathrm{softmax}((-|| Q - K ||_1)_{j=1}^n) \in \mathbb{R}^n $$
I understand all of the processes involved, but I don't understand what the end result of these is and I feel like it is interfering with my ability to grasp papers that deal with attention
This is quite hand-wavy, but hopefully explains the rough idea.
The way to think about dot product attention is to imagine $Q$ is asking questions about the context of a word and has an orthonormal basis corresponding to the base questions. $e_1$: ''am i about water?'', $e_2$: ''am I about financial services?'',$\dots$. $K$ has a canonical orthonormal basis such that $\langle e_i, f_j\rangle=\begin{cases}1\text{ if }i=j,\\0\text{ o/w}\end{cases}$ essentially responding in the affirmative. $f_1$: ''you are about water'', $f_2$: ''you are about financial services'' etc. $V$ is then the vector we assign to the word having found its context. So for example the first element of an orthonormal basis may correspond to the value a bank gets when the previous word river has made us decide it's a bank to do with water, not money.