medium research

Sliding Window Attention

Implement Sliding Window Attention from “Longformer: The Long-Document Transformer” (Beltagy et al., 2020).

Instead of attending to all positions, each query only attends to a local window of w positions on each side (total window size = 2w+1).

Given:

  • Q: shape (seq_len, d_k)
  • K: shape (seq_len, d_k)
  • V: shape (seq_len, d_k)
  • window_size: integer w — attend to positions [i-w, i+w]

For position i, only attend to keys in range [max(0, i-w), min(seq_len-1, i+w)]. Positions outside the window get -infinity before softmax.

Output: Tensor of shape (seq_len, d_k).

Hints

sliding-window longformer beltagy-2020 local-attention efficiency
Detecting runtime...