Transformer Encoder Block

Implement a single Transformer encoder block.

Architecture (pre-norm variant for simplicity):

Self-Attention with residual: $X_1 = X + \text{SelfAttn}(\text{LayerNorm}(X))$
FFN with residual: $X_2 = X_1 + \text{FFN}(\text{LayerNorm}(X_1))$

Where LayerNorm normalizes each row to zero mean and unit variance (with eps=1e-5), then scales by gamma and shifts by beta.

Self-attention: $Q=K=V=\text{normed\_X}$, scores = $QK^T / \sqrt{d}$, output = softmax(scores) @ V

FFN: $\text{ReLU}(x \cdot W_1 + b_1) \cdot W_2 + b_2$

For simplicity, use identity projections (W_Q=W_K=W_V=I) for attention.

Input:

Output: Tensor of shape (seq_len, d_model).