We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
hard
end_to_end
Transformer Decoder Block
Implement a single Transformer decoder block with causal (masked) self-attention and cross-attention.
Architecture (pre-norm):
- Causal Self-Attention + residual: mask future positions with $-\infty$ $X_1 = X + \text{CausalSelfAttn}(\text{LN}(X))$
- Cross-Attention + residual: attend to encoder output $X_2 = X_1 + \text{CrossAttn}(\text{LN}(X_1), \text{encoder\_out})$ where Q comes from decoder, K/V from encoder
- FFN + residual: $X_3 = X_2 + \text{FFN}(\text{LN}(X_2))$
Use identity projections for attention. LayerNorm with eps=1e-5.
Input:
-
X: decoder input shape(tgt_len, d_model) -
enc_out: encoder output shape(src_len, d_model) -
gamma1,beta1,gamma2,beta2,gamma3,beta3: LN params(d_model,)each -
W1: shape(d_model, d_ff),b1,W2: shape(d_ff, d_model),b2
Output: Tensor of shape (tgt_len, d_model).
Hints
transformer
decoder
causal-attention
cross-attention
Sign in to attempt this problem and view the solution.