We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Adapter Layer
Implement a bottleneck adapter layer from Houlsby et al. 2019 (“Parameter-Efficient Transfer Learning for NLP”).
Adapter layers are inserted into pretrained transformers (typically after each FFN sublayer). The base model weights are frozen; only the small adapter parameters are trained, making fine-tuning highly parameter-efficient.
Algorithm
Given input x of shape (..., d):
$$h = x + W_\text{up} \cdot \text{ReLU}(W_\text{down} \cdot x)$$
In matrix notation with row vectors (PyTorch/JAX convention):
down = x @ w_down # (..., b) — bottleneck projection
act = relu(down) # (..., b) — non-linearity
up = act @ w_up # (..., d) — back-projection
return x + up # (..., d) — residual
Why bottleneck?
The bottleneck dimension b << d is what makes adapters parameter-efficient:
only d × b + b × d = 2db new parameters per adapter, compared to d × d
for a full layer. With b = 64 and d = 1024, that’s 131K vs 1M parameters.
Residual connection
The residual (x + up) is crucial: when w_down = w_up = 0, the adapter
is the identity transformation. This lets adapters be inserted into a
pretrained model without breaking it — training starts from identity.
Compare to LoRA
LoRA computes x @ (A @ B) (no non-linearity) and adds the result to the
weight matrix. Adapters compute relu(x @ w_down) @ w_up and add to the
activations directly. Both are parameter-efficient; adapters add
non-linearity, LoRA has a cleaner weight-space interpretation.
Inputs / Output
-
x: shape(..., d)— input activations -
w_down: shape(d, b)— down-projection (bottleneck) -
w_up: shape(b, d)— up-projection
Output: shape (..., d) — x + relu(x @ w_down) @ w_up
Reference: Houlsby et al. 2019 — “Parameter-Efficient Transfer Learning for NLP”
Hints
Sign in to attempt this problem and view the solution.