hard research

Mixture of Experts Routing

Implement Mixture of Experts (MoE) routing from “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (Shazeer et al., 2017).

Given an input token, route it through the top-k experts and combine their outputs.

Given:

x: shape (d,) — input token embedding
gate_weights: shape (d, n_experts) — gating network weights
expert_weights: list of n_experts weight matrices, each shape (d, d) — simple linear experts
top_k: integer — number of experts to activate

Steps:

Compute gate logits: $g = x \cdot W_{gate}$, shape (n_experts,)
Select top-k expert indices
Softmax over only the top-k logits to get routing weights
For each selected expert i: $e_i = x \cdot W_i$
Output = weighted sum: $\sum_{i \in \text{top-k}} w_i \cdot e_i$

Output: Tensor of shape (d,).

Hints

mixture-of-experts moe shazeer-2017 sparse routing

Sign in to attempt this problem and view the solution.

Detecting runtime...