Sliding Window Attention

Implement Sliding Window Attention from “Longformer: The Long-Document Transformer” (Beltagy et al., 2020).

Instead of attending to all positions, each query only attends to a local window of w positions on each side (total window size = 2w+1).

Given:

For position i, only attend to keys in range [max(0, i-w), min(seq_len-1, i+w)]. Positions outside the window get -infinity before softmax.

Output: Tensor of shape (seq_len, d_k).