hard end_to_end

Self-Attention Layer

Implement scaled dot-product self-attention.

Given input X of shape (seq_len, d_model):

  1. Project to queries, keys, values: $Q = X \cdot W_Q$, $K = X \cdot W_K$, $V = X \cdot W_V$
  2. Compute attention scores: $\text{scores} = \frac{Q K^T}{\sqrt{d_k}}$
  3. Apply softmax row-wise: $\text{attn} = \text{softmax}(\text{scores})$
  4. Compute output: $\text{out} = \text{attn} \cdot V$

Input:

  • X: input of shape (seq_len, d_model)
  • W_Q, W_K, W_V: projection matrices of shape (d_model, d_k)

Output: Attention output of shape (seq_len, d_k).

Hints

self-attention transformer softmax scaled-dot-product
Detecting runtime...