We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Subword Tokenizer: Greedy Longest-Prefix-Match
Implement WordPiece-style greedy longest-prefix-match tokenization — the core scanning algorithm used by BERT and its descendants.
How WordPiece tokenization works
Modern language models like BERT operate on subword tokens rather than whole words or individual characters. The tokenizer maintains a fixed vocabulary of subword strings and decomposes any input string into a sequence of vocab entries that together reconstruct the original text.
WordPiece (Schuster & Nakajima, 2012 — popularised by Devlin et al., BERT,
- uses a greedy longest-prefix-match scan:
pos = 0
while pos < len(text):
find the LONGEST prefix of text[pos:] that is in vocab
emit that prefix, advance pos past it
This differs from BPE, which applies pre-computed merge rules in a fixed priority order. WordPiece greedily commits to the longest match at each position without looking ahead.
Termination guarantee
The vocab is required to contain every individual character of text.
This means the inner scan will always find at least a 1-character match,
so the loop is guaranteed to terminate.
Algorithm detail
Starting at pos = 0, at each step try prefixes from longest to shortest:
text[pos:len(text)], text[pos:len(text)-1], …, text[pos:pos+1].
The first prefix that exists in vocab is emitted; pos advances to the
end of that prefix. Repeat until pos == len(text).
Inputs
-
text: the input string to tokenize. -
vocab: list of valid subword token strings. Must contain every single character that appears intext. Convert to asetfor O(1) lookup.
Output: list[str] — the subword tokens whose concatenation equals
text.
Examples
subword_tokenize("cat", ["c","a","t","ca","cat"]) → ["cat"]
subword_tokenize("cats", ["c","a","t","s","ca","cat"]) → ["cat", "s"]
subword_tokenize("abc", ["a","b","c"]) → ["a","b","c"]
References
- Schuster & Nakajima, “Japanese and Korean Voice Search”, ICASSP 2012.
- Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2019.
Hints
Sign in to attempt this problem and view the solution.