hard end_to_end

Transformer Decoder Block

Implement a single Transformer decoder block with causal (masked) self-attention and cross-attention.

Architecture (pre-norm):

  1. Causal Self-Attention + residual: mask future positions with $-\infty$ $X_1 = X + \text{CausalSelfAttn}(\text{LN}(X))$
  2. Cross-Attention + residual: attend to encoder output $X_2 = X_1 + \text{CrossAttn}(\text{LN}(X_1), \text{encoder\_out})$ where Q comes from decoder, K/V from encoder
  3. FFN + residual: $X_3 = X_2 + \text{FFN}(\text{LN}(X_2))$

Use identity projections for attention. LayerNorm with eps=1e-5.

Input:

  • X: decoder input shape (tgt_len, d_model)
  • enc_out: encoder output shape (src_len, d_model)
  • gamma1, beta1, gamma2, beta2, gamma3, beta3: LN params (d_model,) each
  • W1: shape (d_model, d_ff), b1, W2: shape (d_ff, d_model), b2

Output: Tensor of shape (tgt_len, d_model).

Hints

transformer decoder causal-attention cross-attention
Detecting runtime...