hard end_to_end

Transformer Encoder Block

Implement a single Transformer encoder block.

Architecture (pre-norm variant for simplicity):

  1. Self-Attention with residual: $X_1 = X + \text{SelfAttn}(\text{LayerNorm}(X))$
  2. FFN with residual: $X_2 = X_1 + \text{FFN}(\text{LayerNorm}(X_1))$

Where LayerNorm normalizes each row to zero mean and unit variance (with eps=1e-5), then scales by gamma and shifts by beta.

Self-attention: $Q=K=V=\text{normed\_X}$, scores = $QK^T / \sqrt{d}$, output = softmax(scores) @ V

FFN: $\text{ReLU}(x \cdot W_1 + b_1) \cdot W_2 + b_2$

For simplicity, use identity projections (W_Q=W_K=W_V=I) for attention.

Input:

  • X: shape (seq_len, d_model)
  • gamma1, beta1: LayerNorm params for attention, shape (d_model,)
  • gamma2, beta2: LayerNorm params for FFN, shape (d_model,)
  • W1: shape (d_model, d_ff), b1: shape (d_ff,)
  • W2: shape (d_ff, d_model), b2: shape (d_model,)

Output: Tensor of shape (seq_len, d_model).

Hints

transformer encoder self-attention layer-norm ffn
Detecting runtime...