Implement a single Transformer decoder block with causal (masked) self-attention and cross-attention.
Architecture (pre-norm):
Use identity projections for attention. LayerNorm with eps=1e-5.
Input:
X: decoder input shape (tgt_len, d_model) enc_out: encoder output shape (src_len, d_model) gamma1, beta1, gamma2, beta2, gamma3, beta3: LN params (d_model,) each W1: shape (d_model, d_ff), b1, W2: shape (d_ff, d_model), b2
Output: Tensor of shape (tgt_len, d_model).