We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
GLU Activation
Implement the Gated Linear Unit (GLU) activation function introduced by Dauphin et al. (2016) in “Language Modeling with Gated Convolutional Networks.”
The GLU takes an input x and two weight matrices w (gate) and v (value):
$$\text{GLU}(x, W, V) = \sigma(xW) \otimes (xV)$$
where $\sigma$ is the sigmoid function and $\otimes$ denotes element-wise multiplication.
Intuitively, the sigmoid branch acts as a soft gate: for each output dimension, it learns how much of the corresponding linear projection to let through. Values near 1 allow the full signal; values near 0 suppress it.
This is the precursor to SwiGLU (used in LLaMA, PaLM), which swaps sigmoid for the Swish activation: $\text{SwiGLU}(x, W, V) = \text{Swish}(xW) \otimes (xV)$.
Input:
-
x: tensor of shape(..., d_in) -
w: weight matrix of shape(d_in, d_out)— the gate projection -
v: weight matrix of shape(d_in, d_out)— the value projection
Output: Tensor of shape (..., d_out) — sigmoid(x @ w) * (x @ v)
Hints
Sign in to attempt this problem and view the solution.