Implement Mixture of Experts (MoE) routing from “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (Shazeer et al., 2017).
Given an input token, route it through the top-k experts and combine their outputs.
Given:
x: shape (d,) — input token embedding gate_weights: shape (d, n_experts) — gating network weights expert_weights: list of n_experts weight matrices, each shape (d, d) — simple linear experts top_k: integer — number of experts to activate Steps:
(n_experts,)
Output: Tensor of shape (d,).