Tokenize and Pad Batch

Implement batched tokenization with padding — the preprocessing step that turns a variable-length list of strings into a uniform 2-D token matrix ready for batched model forward passes.

Why padding is needed:

Tensor operations (matrix multiply, attention, convolutions) require every sample in a batch to have the same shape. Real texts have different lengths, so we fix a max_len and either truncate long sequences or right-pad short ones with a sentinel pad_id.

Algorithm:

For each text in the batch:

Tokenize character-by-character: ids = [vocab[c] for c in text].
Truncate to max_len: ids = ids[:max_len].
Right-pad to exactly max_len: ids = ids + [pad_id] * (max_len - len(ids)).

Collect all padded sequences into a list of lists and return it.

Inputs:

texts: list of input strings (tokenized character-by-character).
vocab: dict[str, int] — must contain every character that appears in any text.
max_len: target sequence length. Sequences longer than max_len are truncated; shorter ones are padded.
pad_id: integer sentinel used for padding positions (default -1).

Output: list[list[int]] — shape (len(texts), max_len).

Deriving an attention mask:

Using -1 (or any sentinel not in the real vocabulary) makes the mask implicit: attention_mask[i][j] = (tokens[i][j] != pad_id). In production you’d pass this boolean mask to the attention layer to prevent the model from attending to padding positions.

Note: real production tokenizers (BPE, WordPiece, SentencePiece) operate on subword units rather than individual characters. This character-level version isolates the truncate-and-pad logic so you can focus on the mechanics without the complexity of a merge table.

Tokenize and Pad Batch

Hints