Encoding and decoding functions
Encoding and decoding functions¶
With our trained BPE tokenizer (the merges dictionary), we can now write encode and decode functions to go between strings and token lists.
For decoding:
def decode(ids):
# given ids (list of integers), return Python string
tokens = b"".join(vocab[idx] for idx in ids)
text = tokens.decode("utf-8", errors="replace")
return text
We simply look up each token ID in a vocab dictionary (which is just the inverse of merges) to get the bytes, join them together, and decode the result as a UTF-8 string.
For encoding:
def encode(text):
# given a string, return list of integers (the tokens)
tokens = list(text.encode("utf-8"))
while len(tokens) >= 2:
stats = get_stats(tokens)
pair = min(stats, key=lambda p: merges.get(p, float("inf")))
if pair not in merges:
break # nothing else can be merged
idx = merges[pair]
tokens = merge(tokens, pair, idx)
return tokens
We start by converting the string to bytes. Then, as long as there are at least 2 tokens, we:
- Get the token pair statistics
- Find the pair that was merged earliest (has the lowest ID in
merges) - If no pair can be merged, stop
- Otherwise, merge that pair
This gives us the final token list.
And there we have it! A full implementation of a BPE tokenizer in about 50 lines of Python. Of course, real-world tokenizers have a lot more going on, but this covers the core idea. In the next section, we'll look at some of the additional complexities introduced in the tokenizers used by GPT and other large language models.