medium research

Top-K Gating

Implement Top-K Gating for Mixture of Experts layers.

Given gate logits for a batch of tokens, compute the sparse routing weights by keeping only the top-k values and zeroing out the rest, then applying softmax.

Given:

  • logits: shape (batch, n_experts) — raw gate logits
  • k: integer — number of experts to keep per token

Steps:

  1. For each token, find the top-k logit values
  2. Create a mask that is True for the top-k positions
  3. Set non-top-k positions to -infinity
  4. Apply softmax over all experts (non-top-k become ~0)

Output: Tensor of shape (batch, n_experts) — sparse routing weights (summing to 1 per row).

Hints

top-k-gating moe sparse-routing gating
Detecting runtime...