We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
hard
end_to_end
Transformer Encoder Block
Implement a single Transformer encoder block.
Architecture (pre-norm variant for simplicity):
- Self-Attention with residual: $X_1 = X + \text{SelfAttn}(\text{LayerNorm}(X))$
- FFN with residual: $X_2 = X_1 + \text{FFN}(\text{LayerNorm}(X_1))$
Where LayerNorm normalizes each row to zero mean and unit variance (with eps=1e-5),
then scales by gamma and shifts by beta.
Self-attention: $Q=K=V=\text{normed\_X}$, scores = $QK^T / \sqrt{d}$, output = softmax(scores) @ V
FFN: $\text{ReLU}(x \cdot W_1 + b_1) \cdot W_2 + b_2$
For simplicity, use identity projections (W_Q=W_K=W_V=I) for attention.
Input:
-
X: shape(seq_len, d_model) -
gamma1,beta1: LayerNorm params for attention, shape(d_model,) -
gamma2,beta2: LayerNorm params for FFN, shape(d_model,) -
W1: shape(d_model, d_ff),b1: shape(d_ff,) -
W2: shape(d_ff, d_model),b2: shape(d_model,)
Output: Tensor of shape (seq_len, d_model).
Hints
transformer
encoder
self-attention
layer-norm
ffn
Sign in to attempt this problem and view the solution.