Subword Tokenizer: Greedy Longest-Prefix-Match

Implement WordPiece-style greedy longest-prefix-match tokenization — the core scanning algorithm used by BERT and its descendants.

How WordPiece tokenization works

Modern language models like BERT operate on subword tokens rather than whole words or individual characters. The tokenizer maintains a fixed vocabulary of subword strings and decomposes any input string into a sequence of vocab entries that together reconstruct the original text.

WordPiece (Schuster & Nakajima, 2012 — popularised by Devlin et al., BERT,

uses a greedy longest-prefix-match scan:

pos = 0
while pos < len(text):
    find the LONGEST prefix of text[pos:] that is in vocab
    emit that prefix, advance pos past it

This differs from BPE, which applies pre-computed merge rules in a fixed priority order. WordPiece greedily commits to the longest match at each position without looking ahead.

Termination guarantee

The vocab is required to contain every individual character of text. This means the inner scan will always find at least a 1-character match, so the loop is guaranteed to terminate.

Algorithm detail

Starting at pos = 0, at each step try prefixes from longest to shortest: text[pos:len(text)], text[pos:len(text)-1], …, text[pos:pos+1]. The first prefix that exists in vocab is emitted; pos advances to the end of that prefix. Repeat until pos == len(text).

Inputs

text: the input string to tokenize.
vocab: list of valid subword token strings. Must contain every single character that appears in text. Convert to a set for O(1) lookup.

Output: list[str] — the subword tokens whose concatenation equals text.

Examples

subword_tokenize("cat", ["c","a","t","ca","cat"])  → ["cat"]
subword_tokenize("cats", ["c","a","t","s","ca","cat"]) → ["cat", "s"]
subword_tokenize("abc", ["a","b","c"])              → ["a","b","c"]

References

Schuster & Nakajima, “Japanese and Korean Voice Search”, ICASSP 2012.
Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2019.

Subword Tokenizer: Greedy Longest-Prefix-Match

Hints