hard primitives

Implement Scaled Dot-Product Attention

Implement the Scaled Dot-Product Attention mechanism from “Attention Is All You Need”.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Input:

  • Q: query tensor of shape (seq_len_q, d_k)
  • K: key tensor of shape (seq_len_k, d_k)
  • V: value tensor of shape (seq_len_k, d_v)

Output: Attention output of shape (seq_len_q, d_v)

Note: $d_k$ is the dimension of the key vectors (last dimension of Q and K).

Hints

attention transformer self-attention
Detecting runtime...