Encoding and decoding functions

Encoding and decoding functions¶

With our trained BPE tokenizer (the merges dictionary), we can now write encode and decode functions to go between strings and token lists.

For decoding:

def decode(ids):
  # given ids (list of integers), return Python string
  tokens = b"".join(vocab[idx] for idx in ids)
  text = tokens.decode("utf-8", errors="replace")
  return text

We simply look up each token ID in a vocab dictionary (which is just the inverse of merges) to get the bytes, join them together, and decode the result as a UTF-8 string.

For encoding:

def encode(text):
  # given a string, return list of integers (the tokens)
  tokens = list(text.encode("utf-8"))
  while len(tokens) >= 2:
    stats = get_stats(tokens)
    pair = min(stats, key=lambda p: merges.get(p, float("inf")))
    if pair not in merges:
      break # nothing else can be merged
    idx = merges[pair]
    tokens = merge(tokens, pair, idx)
  return tokens

We start by converting the string to bytes. Then, as long as there are at least 2 tokens, we:

Get the token pair statistics
Find the pair that was merged earliest (has the lowest ID in merges)
If no pair can be merged, stop
Otherwise, merge that pair

This gives us the final token list.

And there we have it! A full implementation of a BPE tokenizer in about 50 lines of Python. Of course, real-world tokenizers have a lot more going on, but this covers the core idea. In the next section, we'll look at some of the additional complexities introduced in the tokenizers used by GPT and other large language models.

Last update: 2024-08-21