hard research

Mixture of Experts Routing

Implement Mixture of Experts (MoE) routing from “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (Shazeer et al., 2017).

Given an input token, route it through the top-k experts and combine their outputs.

Given:

  • x: shape (d,) — input token embedding
  • gate_weights: shape (d, n_experts) — gating network weights
  • expert_weights: list of n_experts weight matrices, each shape (d, d) — simple linear experts
  • top_k: integer — number of experts to activate

Steps:

  1. Compute gate logits: $g = x \cdot W_{gate}$, shape (n_experts,)
  2. Select top-k expert indices
  3. Softmax over only the top-k logits to get routing weights
  4. For each selected expert i: $e_i = x \cdot W_i$
  5. Output = weighted sum: $\sum_{i \in \text{top-k}} w_i \cdot e_i$

Output: Tensor of shape (d,).

Hints

mixture-of-experts moe shazeer-2017 sparse routing
Detecting runtime...