Adapter Layer

Implement a bottleneck adapter layer from Houlsby et al. 2019 (“Parameter-Efficient Transfer Learning for NLP”).

Adapter layers are inserted into pretrained transformers (typically after each FFN sublayer). The base model weights are frozen; only the small adapter parameters are trained, making fine-tuning highly parameter-efficient.

Algorithm

Given input x of shape (..., d):

$$h = x + W_\text{up} \cdot \text{ReLU}(W_\text{down} \cdot x)$$

In matrix notation with row vectors (PyTorch/JAX convention):

down = x @ w_down       # (..., b)  — bottleneck projection
act  = relu(down)       # (..., b)  — non-linearity
up   = act @ w_up       # (..., d)  — back-projection
return x + up           # (..., d)  — residual

Why bottleneck?

The bottleneck dimension b << d is what makes adapters parameter-efficient: only d × b + b × d = 2db new parameters per adapter, compared to d × d for a full layer. With b = 64 and d = 1024, that’s 131K vs 1M parameters.

Residual connection

The residual (x + up) is crucial: when w_down = w_up = 0, the adapter is the identity transformation. This lets adapters be inserted into a pretrained model without breaking it — training starts from identity.

Compare to LoRA

LoRA computes x @ (A @ B) (no non-linearity) and adds the result to the weight matrix. Adapters compute relu(x @ w_down) @ w_up and add to the activations directly. Both are parameter-efficient; adapters add non-linearity, LoRA has a cleaner weight-space interpretation.

Inputs / Output

x: shape (..., d) — input activations
w_down: shape (d, b) — down-projection (bottleneck)
w_up: shape (b, d) — up-projection

Output: shape (..., d) — x + relu(x @ w_down) @ w_up

Reference: Houlsby et al. 2019 — “Parameter-Efficient Transfer Learning for NLP”

Algorithm

Why bottleneck?

Residual connection

Compare to LoRA

Inputs / Output

Hints