Top-K Gating

Implement Top-K Gating for Mixture of Experts layers.

Given gate logits for a batch of tokens, compute the sparse routing weights by keeping only the top-k values and zeroing out the rest, then applying softmax.

Given:

logits: shape (batch, n_experts) — raw gate logits
k: integer — number of experts to keep per token

Steps:

For each token, find the top-k logit values
Create a mask that is True for the top-k positions
Set non-top-k positions to -infinity
Apply softmax over all experts (non-top-k become ~0)

Output: Tensor of shape (batch, n_experts) — sparse routing weights (summing to 1 per row).

Hints