SentencePiece Tokenizer¶

While the GPT tokenizers use a byte-level BPE algorithm, there's another popular tokenizer called SentencePiece that's worth knowing about. SentencePiece is used in many state-of-the-art models, including the BERT and Llama series.

The main difference between SentencePiece and the GPT tokenizers is that SentencePiece operates directly on Unicode characters, rather than bytes. This means it can handle a wider variety of scripts and languages more easily. However, it also means that the details of the algorithm are a bit different.

Let's dive in and see how it works!

Installing SentencePiece¶

First, we need to install the SentencePiece library. You can do this with pip:

pip install sentencepiece

Last update: 2024-08-21