Implement Top-K Gating for Mixture of Experts layers.
Given gate logits for a batch of tokens, compute the sparse routing weights by keeping only the top-k values and zeroing out the rest, then applying softmax.
Given:
logits: shape (batch, n_experts) — raw gate logits k: integer — number of experts to keep per token Steps:
Output: Tensor of shape (batch, n_experts) — sparse routing weights (summing to 1 per row).