Implement Scaled Dot-Product Attention

Implement the Scaled Dot-Product Attention mechanism from “Attention Is All You Need”.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Input:

Output: Attention output of shape (seq_len_q, d_v)

Note: $d_k$ is the dimension of the key vectors (last dimension of Q and K).