hard end_to_end

BPE Encode Text

Implement BPE text encoding — the final encoding step used by GPT-2, GPT-3, BLOOM, and most modern large language models.

What is BPE encoding?

Byte-Pair Encoding (Sennrich et al., ACL 2016) is a subword tokenization algorithm. Training produces two artefacts:

  1. A vocabulary mapping token strings to integer ids.
  2. An ordered list of merge rules, each being a pair of token strings.

To encode a new string, you apply the learned merges in priority order to a character-level representation of the text. Each merge rule says “wherever you see these two adjacent tokens, replace them with their concatenation.” After all rules are applied the remaining tokens are looked up in the vocabulary to produce the final token id sequence.

Algorithm:

  1. Split text into individual characters: tokens = list(text).
  2. For each merge rule [a, b] in merges (in order):
    • Walk tokens left-to-right.
    • Wherever adjacent pair [tokens[i], tokens[i+1]] == [a, b], replace them with the single token a + b (non-overlapping, greedy).
  3. Look up each remaining token in vocab and return the list of ids.

Input:

  • text: raw input string (no whitespace pre-tokenization for this variant)
  • vocab: dict mapping token string → integer id
  • merges: ordered list of merge rules (each a 2-element list [a, b])

Output: list of integer token ids.

Edge cases:

  • If text is empty, return [].
  • If a merge rule does not match anything in the current token sequence, the sequence is unchanged (continue to the next rule).
  • The vocab is guaranteed to contain every possible token that survives after all merges are applied.

Example:

text   = "lower"
merges = [["l","o"], ["e","r"], ["lo","w"], ["low","er"]]
vocab  = {"l":0,"o":1,"w":2,"e":3,"r":4,"lo":5,"er":8,"low":7,"lower":6}

Step 0: ["l", "o", "w", "e", "r"]
Step 1: ["lo", "w", "e", "r"]        # merge ["l","o"]
Step 2: ["lo", "w", "er"]            # merge ["e","r"]
Step 3: ["low", "er"]               # merge ["lo","w"]
Step 4: ["lower"]                   # merge ["low","er"]
ids  = [6]

This problem builds directly on bpe-merge-step — each merge rule in the loop body is exactly one application of that step.

Reference: Sennrich, Haddow & Birch, “Neural Machine Translation of Rare Words with Subword Units”, ACL 2016. Used in production by GPT-2/3, BLOOM, and many other LLMs.

Hints

tokenization bpe llm

Sign in to attempt this problem and view the solution.