OpenAI's GPT-2 Encoder¶
Now that we understand the basics of the BPE algorithm and how it's used in the GPT tokenizers, let's take a look at the actual implementation released by OpenAI for GPT-2.
In the GPT-2 repository on GitHub, under the src directory, you'll find a file named encoder.py. Despite its name, this file actually contains the full tokenizer implementation, including both encoding and decoding functionality.
Let's walk through this file and see how it relates to the BPE algorithm we implemented earlier.
At the bottom of the file, you'll see this code:
def get_encoder(models_dir):
with open(os.path.join(models_dir, 'encoder.json'), 'r') as f:
encoder = json.load(f)
with open(os.path.join(models_dir, 'vocab.bpe'), 'r', encoding="utf-8") as f:
bpe_data = f.read()
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
return Encoder(
encoder=encoder,
bpe_merges=bpe_merges,
)
This function loads two files: encoder.json and vocab.bpe. The encoder.json file contains a mapping from tokens to their integer IDs, similar to our vocab dictionary from earlier. The vocab.bpe file contains the actual merge operations learned by the BPE algorithm, similar to our merges dictionary.
These files are loaded and passed to the Encoder class constructor. Let's see how that class is implemented:
class Encoder:
def __init__(self, encoder, bpe_merges):
self.encoder = encoder
self.decoder = {v:k for k,v in self.encoder.items()}
self.byte_encoder = bytes_to_unicode()
self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
self.cache = {}
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
In the constructor, we see:
self.encoderis set to theencoderpassed in (loaded fromencoder.json).self.decoderis created as the inverse ofself.encoder, mapping integer IDs back to tokens.self.byte_encoderandself.byte_decoderare created. These are used for a preprocessing step that converts all bytes to a range of Unicode characters. We won't go into detail on this, as it's not essential to understanding the BPE algorithm.self.bpe_ranksis created as a dictionary mapping each merge pair to its rank (the order in which it was merged). This is used to determine which pairs to merge first during encoding.self.cacheis an empty dictionary used to cache the results of encoding.self.patis the regex pattern we discussed in the previous section.