We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
hard
research
KV Cache for Autoregressive Decoding
Implement KV cache management for autoregressive transformer decoding.
During autoregressive generation, each new token only needs to attend to all previous tokens. The KV cache stores previously computed key and value projections to avoid recomputation.
Given:
-
cached_K: shape(cache_len, d_k)โ previously cached keys -
cached_V: shape(cache_len, d_k)โ previously cached values -
new_q: shape(1, d_k)โ query for the new token -
new_k: shape(1, d_k)โ key for the new token -
new_v: shape(1, d_k)โ value for the new token
Steps:
- Append new_k to cached_K, and new_v to cached_V
- Compute attention: output = softmax(new_q @ full_K^T / sqrt(d_k)) @ full_V
Output: A dict with:
-
"output": shape(1, d_k)โ the attention output for the new token -
"updated_K": shape(cache_len+1, d_k)โ updated key cache -
"updated_V": shape(cache_len+1, d_k)โ updated value cache
Hints
kv-cache
autoregressive
decoding
inference
transformer
Sign in to attempt this problem and view the solution.