medium research

Cross Attention

Implement cross-attention as used in transformer decoders.

In cross-attention, queries come from the decoder and keys/values come from the encoder. This is the mechanism that allows the decoder to attend to encoder outputs.

Given:

  • Q: shape (tgt_len, d_k) — decoder queries
  • K: shape (src_len, d_k) — encoder keys
  • V: shape (src_len, d_k) — encoder values

$$\text{CrossAttn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Note: tgt_len and src_len can be different!

Output: Tensor of shape (tgt_len, d_k).

Hints

cross-attention decoder encoder-decoder transformer
Detecting runtime...