OpenAI's GPT-2 Encoder¶

Now that we understand the basics of the BPE algorithm and how it's used in the GPT tokenizers, let's take a look at the actual implementation released by OpenAI for GPT-2.

In the GPT-2 repository on GitHub, under the src directory, you'll find a file named encoder.py. Despite its name, this file actually contains the full tokenizer implementation, including both encoding and decoding functionality.

Let's walk through this file and see how it relates to the BPE algorithm we implemented earlier.

At the bottom of the file, you'll see this code:

def get_encoder(models_dir):
    with open(os.path.join(models_dir, 'encoder.json'), 'r') as f:
        encoder = json.load(f)
    with open(os.path.join(models_dir, 'vocab.bpe'), 'r', encoding="utf-8") as f:
        bpe_data = f.read()
    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
    return Encoder(
        encoder=encoder,
        bpe_merges=bpe_merges,
    )

This function loads two files: encoder.json and vocab.bpe. The encoder.json file contains a mapping from tokens to their integer IDs, similar to our vocab dictionary from earlier. The vocab.bpe file contains the actual merge operations learned by the BPE algorithm, similar to our merges dictionary.

These files are loaded and passed to the Encoder class constructor. Let's see how that class is implemented:

class Encoder:
    def __init__(self, encoder, bpe_merges):
        self.encoder = encoder
        self.decoder = {v:k for k,v in self.encoder.items()}
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
        self.cache = {}
        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

In the constructor, we see:

self.encoder is set to the encoder passed in (loaded from encoder.json).
self.decoder is created as the inverse of self.encoder, mapping integer IDs back to tokens.
self.byte_encoder and self.byte_decoder are created. These are used for a preprocessing step that converts all bytes to a range of Unicode characters. We won't go into detail on this, as it's not essential to understanding the BPE algorithm.
self.bpe_ranks is created as a dictionary mapping each merge pair to its rank (the order in which it was merged). This is used to determine which pairs to merge first during encoding.
self.cache is an empty dictionary used to cache the results of encoding.
self.pat is the regex pattern we discussed in the previous section.

Last update: 2024-08-21