We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
BPE Encode Text
Implement BPE text encoding — the final encoding step used by GPT-2, GPT-3, BLOOM, and most modern large language models.
What is BPE encoding?
Byte-Pair Encoding (Sennrich et al., ACL 2016) is a subword tokenization algorithm. Training produces two artefacts:
- A vocabulary mapping token strings to integer ids.
- An ordered list of merge rules, each being a pair of token strings.
To encode a new string, you apply the learned merges in priority order to a character-level representation of the text. Each merge rule says “wherever you see these two adjacent tokens, replace them with their concatenation.” After all rules are applied the remaining tokens are looked up in the vocabulary to produce the final token id sequence.
Algorithm:
-
Split
textinto individual characters:tokens = list(text). -
For each merge rule
[a, b]inmerges(in order):-
Walk
tokensleft-to-right. -
Wherever adjacent pair
[tokens[i], tokens[i+1]] == [a, b], replace them with the single tokena + b(non-overlapping, greedy).
-
Walk
-
Look up each remaining token in
vocaband return the list of ids.
Input:
-
text: raw input string (no whitespace pre-tokenization for this variant) -
vocab: dict mapping token string → integer id -
merges: ordered list of merge rules (each a 2-element list[a, b])
Output: list of integer token ids.
Edge cases:
-
If
textis empty, return[]. - If a merge rule does not match anything in the current token sequence, the sequence is unchanged (continue to the next rule).
- The vocab is guaranteed to contain every possible token that survives after all merges are applied.
Example:
text = "lower"
merges = [["l","o"], ["e","r"], ["lo","w"], ["low","er"]]
vocab = {"l":0,"o":1,"w":2,"e":3,"r":4,"lo":5,"er":8,"low":7,"lower":6}
Step 0: ["l", "o", "w", "e", "r"]
Step 1: ["lo", "w", "e", "r"] # merge ["l","o"]
Step 2: ["lo", "w", "er"] # merge ["e","r"]
Step 3: ["low", "er"] # merge ["lo","w"]
Step 4: ["lower"] # merge ["low","er"]
ids = [6]
This problem builds directly on bpe-merge-step — each merge rule in the
loop body is exactly one application of that step.
Reference: Sennrich, Haddow & Birch, “Neural Machine Translation of Rare Words with Subword Units”, ACL 2016. Used in production by GPT-2/3, BLOOM, and many other LLMs.
Hints
Sign in to attempt this problem and view the solution.