SwiGLU Activation

Implement the SwiGLU activation from “GLU Variants Improve Transformer” (Shazeer, 2020).

SwiGLU is used in modern LLMs (PaLM, LLaMA). It splits the input into two halves and applies a gated activation:

$$\text{SwiGLU}(x, W, V, b, c) = \text{Swish}(xW + b) \otimes (xV + c)$$

Where Swish(z) = z * sigmoid(z).

For simplicity, implement the core operation given pre-computed linear projections:

$$\text{SwiGLU}(\text{gate}, \text{value}) = (\text{gate} \cdot \sigma(\text{gate})) \otimes \text{value}$$

Output: Tensor of shape (batch, d).