We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
BPE Merge Step
Implement a single merge step of Byte-Pair Encoding (BPE).
BPE was originally a text-compression algorithm (Gage 1994) that was adapted for subword tokenization by Sennrich et al. in “Neural Machine Translation of Rare Words with Subword Units” (ACL 2016). It is now the standard tokenizer used by GPT-2, GPT-4, LLaMA, and most other large language models.
How BPE tokenization works:
- Start with a vocabulary of individual characters (or bytes).
- Count all adjacent symbol pairs in the training corpus.
- Merge the most frequent pair everywhere it appears, creating a new symbol.
- Repeat until the vocabulary reaches its target size.
During encoding, the learned merge rules are applied in order to a fresh character sequence — each rule is a single merge step like this one. After all rules have been applied, the remaining tokens are the final subword tokenization.
Your task: implement one merge step.
Given a sequence of token strings and a bigram pair, return a new sequence
where every consecutive (non-overlapping, left-to-right) occurrence of pair
has been replaced by the concatenation of its two elements.
Input:
-
tokens: list of token strings, e.g.["l", "o", "w", "e", "r"] -
pair: 2-element list giving the bigram to merge, e.g.["l", "o"]
Output: new list of strings after the merge, e.g. ["lo", "w", "e", "r"]
Edge cases:
-
Merges are greedy left-to-right and non-overlapping:
["a", "a", "a"]merged with["a", "a"]→["aa", "a"], not["a", "aa"]. -
If
pairdoes not appear, return a copy oftokensunchanged. -
An empty
tokenslist returns[].
This single step composes into a full BPE encoder (bpe-encode-text): apply
each merge rule from the learned merge table in priority order to turn any
string into its subword token sequence.
Reference: Sennrich, Haddow & Birch, “Neural Machine Translation of Rare Words with Subword Units”, ACL 2016.
Hints
Sign in to attempt this problem and view the solution.