Implement a single Transformer encoder block.
Architecture (pre-norm variant for simplicity):
Where LayerNorm normalizes each row to zero mean and unit variance (with eps=1e-5),
then scales by gamma and shifts by beta.
Self-attention: $Q=K=V=\text{normed\_X}$, scores = $QK^T / \sqrt{d}$, output = softmax(scores) @ V
FFN: $\text{ReLU}(x \cdot W_1 + b_1) \cdot W_2 + b_2$
For simplicity, use identity projections (W_Q=W_K=W_V=I) for attention.
Input:
X: shape (seq_len, d_model) gamma1, beta1: LayerNorm params for attention, shape (d_model,) gamma2, beta2: LayerNorm params for FFN, shape (d_model,) W1: shape (d_model, d_ff), b1: shape (d_ff,) W2: shape (d_ff, d_model), b2: shape (d_model,)
Output: Tensor of shape (seq_len, d_model).