hard end_to_end

Self-Attention Layer

Implement scaled dot-product self-attention.

Given input X of shape (seq_len, d_model):

Project to queries, keys, values: $Q = X \cdot W_Q$, $K = X \cdot W_K$, $V = X \cdot W_V$
Compute attention scores: $\text{scores} = \frac{Q K^T}{\sqrt{d_k}}$
Apply softmax row-wise: $\text{attn} = \text{softmax}(\text{scores})$
Compute output: $\text{out} = \text{attn} \cdot V$

Input:

X: input of shape (seq_len, d_model)
W_Q, W_K, W_V: projection matrices of shape (d_model, d_k)

Output: Attention output of shape (seq_len, d_k).

Hints

self-attention transformer softmax scaled-dot-product

Sign in to attempt this problem and view the solution.

Detecting runtime...