Implement the SwiGLU activation from “GLU Variants Improve Transformer” (Shazeer, 2020).
SwiGLU is used in modern LLMs (PaLM, LLaMA). It splits the input into two halves and applies a gated activation:
$$\text{SwiGLU}(x, W, V, b, c) = \text{Swish}(xW + b) \otimes (xV + c)$$
Where Swish(z) = z * sigmoid(z).
For simplicity, implement the core operation given pre-computed linear projections:
gate: shape (batch, d) — the xW+b projection value: shape (batch, d) — the xV+c projection $$\text{SwiGLU}(\text{gate}, \text{value}) = (\text{gate} \cdot \sigma(\text{gate})) \otimes \text{value}$$
Output: Tensor of shape (batch, d).