Implement Sliding Window Attention from “Longformer: The Long-Document Transformer” (Beltagy et al., 2020).
Instead of attending to all positions, each query only attends to a local window
of w positions on each side (total window size = 2w+1).
Given:
Q: shape (seq_len, d_k) K: shape (seq_len, d_k) V: shape (seq_len, d_k) window_size: integer w — attend to positions [i-w, i+w] For position i, only attend to keys in range [max(0, i-w), min(seq_len-1, i+w)]. Positions outside the window get -infinity before softmax.
Output: Tensor of shape (seq_len, d_k).