GLU Activation

Implement the Gated Linear Unit (GLU) activation function introduced by Dauphin et al. (2016) in “Language Modeling with Gated Convolutional Networks.”

The GLU takes an input x and two weight matrices w (gate) and v (value):

$$\text{GLU}(x, W, V) = \sigma(xW) \otimes (xV)$$

where $\sigma$ is the sigmoid function and $\otimes$ denotes element-wise multiplication.

Intuitively, the sigmoid branch acts as a soft gate: for each output dimension, it learns how much of the corresponding linear projection to let through. Values near 1 allow the full signal; values near 0 suppress it.

This is the precursor to SwiGLU (used in LLaMA, PaLM), which swaps sigmoid for the Swish activation: $\text{SwiGLU}(x, W, V) = \text{Swish}(xW) \otimes (xV)$.

Input:

x: tensor of shape (..., d_in)
w: weight matrix of shape (d_in, d_out) — the gate projection
v: weight matrix of shape (d_in, d_out) — the value projection

Output: Tensor of shape (..., d_out) — sigmoid(x @ w) * (x @ v)

Hints