Brief taste of the complexities of tokenization

Brief taste of the complexities of tokenization¶

Before we dive into details of the implementation, let's briefly motivate the need to understand the tokenization process in some detail. Tokenization is at the heart of a lot of weirdness in LLMs and I would advise that you do not brush it off. A lot of the issues that may look like issues with the neural network architecture actually trace back to tokenization. Here are just a few examples:

Why can't LLM spell words? Tokenization.
Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.
Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.
Why is LLM bad at simple arithmetic? Tokenization.
Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.
Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? Tokenization.
What is this weird warning I get about a "trailing whitespace"? Tokenization.
Why did the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.
Why should I prefer to use YAML over JSON with LLMs? Tokenization.
Why is LLM not actually end-to-end language modeling? Tokenization.
What is the real root of suffering? Tokenization.

We will loop back around to these at the end of the video.

Last update: 2024-08-21