Implement cross-attention as used in transformer decoders.
In cross-attention, queries come from the decoder and keys/values come from the encoder. This is the mechanism that allows the decoder to attend to encoder outputs.
Given:
Q: shape (tgt_len, d_k) — decoder queries K: shape (src_len, d_k) — encoder keys V: shape (src_len, d_k) — encoder values $$\text{CrossAttn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
Note: tgt_len and src_len can be different!
Output: Tensor of shape (tgt_len, d_k).