Implement the Scaled Dot-Product Attention mechanism from “Attention Is All You Need”.
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Input:
Q: query tensor of shape (seq_len_q, d_k) K: key tensor of shape (seq_len_k, d_k) V: value tensor of shape (seq_len_k, d_v)
Output: Attention output of shape (seq_len_q, d_v)
Note: $d_k$ is the dimension of the key vectors (last dimension of Q and K).