We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
hard
research
Mixture of Experts Routing
Implement Mixture of Experts (MoE) routing from “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (Shazeer et al., 2017).
Given an input token, route it through the top-k experts and combine their outputs.
Given:
-
x: shape(d,)— input token embedding -
gate_weights: shape(d, n_experts)— gating network weights -
expert_weights: list of n_experts weight matrices, each shape(d, d)— simple linear experts -
top_k: integer — number of experts to activate
Steps:
-
Compute gate logits: $g = x \cdot W_{gate}$, shape
(n_experts,) - Select top-k expert indices
- Softmax over only the top-k logits to get routing weights
- For each selected expert i: $e_i = x \cdot W_i$
- Output = weighted sum: $\sum_{i \in \text{top-k}} w_i \cdot e_i$
Output: Tensor of shape (d,).
Hints
mixture-of-experts
moe
shazeer-2017
sparse
routing
Sign in to attempt this problem and view the solution.