medium research

SwiGLU Activation

Implement the SwiGLU activation from “GLU Variants Improve Transformer” (Shazeer, 2020).

SwiGLU is used in modern LLMs (PaLM, LLaMA). It splits the input into two halves and applies a gated activation:

$$\text{SwiGLU}(x, W, V, b, c) = \text{Swish}(xW + b) \otimes (xV + c)$$

Where Swish(z) = z * sigmoid(z).

For simplicity, implement the core operation given pre-computed linear projections:

  • gate: shape (batch, d) — the xW+b projection
  • value: shape (batch, d) — the xV+c projection

$$\text{SwiGLU}(\text{gate}, \text{value}) = (\text{gate} \cdot \sigma(\text{gate})) \otimes \text{value}$$

Output: Tensor of shape (batch, d).

Hints

swiglu glu shazeer-2020 activation llm
Detecting runtime...