medium primitives

BPE Merge Step

Implement a single merge step of Byte-Pair Encoding (BPE).

BPE was originally a text-compression algorithm (Gage 1994) that was adapted for subword tokenization by Sennrich et al. in “Neural Machine Translation of Rare Words with Subword Units” (ACL 2016). It is now the standard tokenizer used by GPT-2, GPT-4, LLaMA, and most other large language models.

How BPE tokenization works:

  1. Start with a vocabulary of individual characters (or bytes).
  2. Count all adjacent symbol pairs in the training corpus.
  3. Merge the most frequent pair everywhere it appears, creating a new symbol.
  4. Repeat until the vocabulary reaches its target size.

During encoding, the learned merge rules are applied in order to a fresh character sequence — each rule is a single merge step like this one. After all rules have been applied, the remaining tokens are the final subword tokenization.

Your task: implement one merge step.

Given a sequence of token strings and a bigram pair, return a new sequence where every consecutive (non-overlapping, left-to-right) occurrence of pair has been replaced by the concatenation of its two elements.

Input:

  • tokens: list of token strings, e.g. ["l", "o", "w", "e", "r"]
  • pair: 2-element list giving the bigram to merge, e.g. ["l", "o"]

Output: new list of strings after the merge, e.g. ["lo", "w", "e", "r"]

Edge cases:

  • Merges are greedy left-to-right and non-overlapping: ["a", "a", "a"] merged with ["a", "a"]["aa", "a"], not ["a", "aa"].
  • If pair does not appear, return a copy of tokens unchanged.
  • An empty tokens list returns [].

This single step composes into a full BPE encoder (bpe-encode-text): apply each merge rule from the learned merge table in priority order to turn any string into its subword token sequence.

Reference: Sennrich, Haddow & Birch, “Neural Machine Translation of Rare Words with Subword Units”, ACL 2016.

Hints

tokenization bpe

Sign in to attempt this problem and view the solution.