We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
medium
research
Top-K Gating
Implement Top-K Gating for Mixture of Experts layers.
Given gate logits for a batch of tokens, compute the sparse routing weights by keeping only the top-k values and zeroing out the rest, then applying softmax.
Given:
-
logits: shape(batch, n_experts)โ raw gate logits -
k: integer โ number of experts to keep per token
Steps:
- For each token, find the top-k logit values
- Create a mask that is True for the top-k positions
- Set non-top-k positions to -infinity
- Apply softmax over all experts (non-top-k become ~0)
Output: Tensor of shape (batch, n_experts) โ sparse routing weights (summing to 1 per row).
Hints
top-k-gating
moe
sparse-routing
gating
Sign in to attempt this problem and view the solution.