We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Tokenize and Pad Batch
Implement batched tokenization with padding — the preprocessing step that turns a variable-length list of strings into a uniform 2-D token matrix ready for batched model forward passes.
Why padding is needed:
Tensor operations (matrix multiply, attention, convolutions) require every
sample in a batch to have the same shape. Real texts have different lengths,
so we fix a max_len and either truncate long sequences or right-pad short
ones with a sentinel pad_id.
Algorithm:
For each text in the batch:
-
Tokenize character-by-character:
ids = [vocab[c] for c in text]. -
Truncate to
max_len:ids = ids[:max_len]. -
Right-pad to exactly
max_len:ids = ids + [pad_id] * (max_len - len(ids)).
Collect all padded sequences into a list of lists and return it.
Inputs:
-
texts: list of input strings (tokenized character-by-character). -
vocab:dict[str, int]— must contain every character that appears in any text. -
max_len: target sequence length. Sequences longer thanmax_lenare truncated; shorter ones are padded. -
pad_id: integer sentinel used for padding positions (default-1).
Output: list[list[int]] — shape (len(texts), max_len).
Deriving an attention mask:
Using -1 (or any sentinel not in the real vocabulary) makes the mask
implicit: attention_mask[i][j] = (tokens[i][j] != pad_id). In production
you’d pass this boolean mask to the attention layer to prevent the model
from attending to padding positions.
Note: real production tokenizers (BPE, WordPiece, SentencePiece) operate on subword units rather than individual characters. This character-level version isolates the truncate-and-pad logic so you can focus on the mechanics without the complexity of a merge table.
Hints
Sign in to attempt this problem and view the solution.