Introduction
Larger Language Models, machine translation or any NLP model process text using tokens which is sequence of characters found in a set of text. The tokenizer learn to understind the statistical relationships between these tokens and that happens during the training. Each token can be word or subword depends on the number of the characters and the tokenization algorithm used. The reason why this step is very important in building LLM is it helps the machine understind the words better by derive meaning from tokens, for example the word "ringing" is a combination of tokens "ring" and "ing", if we never heard of "ringing" befor but we know what "ring" means, applying tokenization to the word help us analyze and derive meaning from. Unlike training other models, stochastic gradient descent is applied to minimize the loss for each batch. Training tokenizer model on the other hand uses statistical operations to create best lookup table for the tokens, meaning it is deterministic, you always get the same results if trained on the same corpus and the same algorithms.
Types of Tokenization
The three main types of Tokenization are word-level, subword-level and character-level.
Word-level:This approach breaks down text into words, making it ideal for languages with clear bounderies such as English. For example the sentence "I love reading" can be tokenized into ["I", "love", "reading"].
text = "I love reading"
text.split(" ")
Output
["I", "love", "reading"]
Character-level:Character-level involves spliting the text into individual characters including letters, spaces, ponctuation and symbols, this is very helpful technique when it cames into granular level such as spelling correction or languages that lack clear word boundries. "I love reading" -> ["I"," " ,"l","o", "v","e"," ", "r", "e","a","d","i","n","g"]
text = "I love reading"
list(text)
Output
['I', ' ', 'l', 'o', 'v', 'e', ' ', 'r', 'e', 'a', 'd', 'i', 'n', 'g']
Subword-level: Subword tokenization strikes between character and word-level, this method break down texts into units that can be word or part of the word. For example the word "Ringing" can be ["ring", "ing"], helping the machine understind that the word "ringing" cames from "ring". This is the most common approach in GPT and Bert. The different between Subword-level and other is that subword-level need to be trained and see the entire corpus while others are straightforward. Lets implement one...
We will download some shakespeare poems from github
!wget https://github.com/cobanov/shakespeare-dataset/raw/refs/heads/main/text/a-midsummer-nights-dream_TXT_FolgerShakespeare.txt
!wget https://raw.githubusercontent.com/cobanov/shakespeare-dataset/refs/heads/main/text/alls-well-that-ends-well_TXT_FolgerShakespeare.txt
!wget https://github.com/cobanov/shakespeare-dataset/raw/refs/heads/main/text/antony-and-cleopatra_TXT_FolgerShakespeare.txt
!wget https://github.com/cobanov/shakespeare-dataset/raw/refs/heads/main/text/as-you-like-it_TXT_FolgerShakespeare.txt
If you are not running from notebook make sure you remove '!' from the command .
After downloading the data, we are going to load it using open()
function.
shakespear_content = [open("a-midsummer-nights-dream_TXT_FolgerShakespeare.txt", "r").read(),
open("alls-well-that-ends-well_TXT_FolgerShakespeare.txt", "r").read(),
open("antony-and-cleopatra_TXT_FolgerShakespeare.txt", "r").read(),
open("antony-and-cleopatra_TXT_FolgerShakespeare.txt", "r").read()]
Check the first poem with the first 30 characters :
shakespear_content[0][:30]
Output
'A Midsummer Night's Dream\nby W'
Now we will load the tokenizer model such as 'bert', feel free to load any model you want such as gpt2
.
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
Lets see how it traits an example
example = shakespear_content[0][:100]
example
Output
'A Midsummer Night's Dream\nby William Shakespeare\nEdited by Barbara A. Mowat and Paul Werstine\n with'
pass it through the old tokenizer
tokens = old_tokenizer.tokenize(example)
tokens
Output
['A','Mid', '##su', '##mmer', 'Night', "'", 's', 'Dream', 'by', 'William','Shakespeare', 'Edit','##ed','by','Barbara','A','.','Mo','##wat','and','Paul','We', '##rst','##ine','with']
Now we have trained our new tokenizer on 'shakespear_dataset' with '52000' vocabulary size, lets test it.
new_tokens = tokenizer.tokenize(example)
new_tokens
['A','Midsummer','Night',"'",'s','Dream','by','William','Shakespeare','Edited','by','Barbara','A','.','Mowat','and','Paul','Werstine','with']
As you can see the tokens that cames from old one is tnot the same as the new one, old one trait text as it was trained on which is the entire internet however, our new tokenizr understind the vocabulary well and knows how to split the text well .
We get the encoded ids by typing the following
tokenizer.encode(example)
Output
[2,21368,292,2753,3198,5403,292,5422,31,13,5425,167,3440,5442,206,4004,5429,261,3]
Now these ids are what we feed to the rest of the model not the tokens above. Each id here represent the token above. Now we have to save it for future work and we dont have to train it again .
tokenizer.save_pretrained("shakspear-tokenizer")
You should get something like
('shakspear-tokenizer/tokenizer_config.json',
'shakspear-tokenizer/special_tokens_map.json',
'shakspear-tokenizer/vocab.txt',
'shakspear-tokenizer/added_tokens.json',
'shakspear-tokenizer/tokenizer.json')
You can upload it to hugging face
from huggingface_hub import notebook_login
notebook_login()
It will ask for hugging face api key, then you can push it
tokenizer.push_to_hub("shakespear-tokenizer")
You can find mine here https://huggingface.co/otmanheddouch/shakespear-tokenizer .
Tokenization is the first step toward building LLM, either you can use the pretrained models such as Bert for your own use case (which is prefered), or if your problem is different such as new language, new characters, new domain or new style, then training one from scratch would help. Training the tokenizer on your data, you have to be very careful when selecting the data and the architecture, the data for instance has to be on the language, domain, style of your problem. Why is this important ? well imaging going to the mechanic and tell him you are an expert mechanic but you don't know the mechanic's language, the tools name, different operations and car's components. That is exactly what would happen when you train the tokenizer on different data then the probelm you are trying to solve .
Tokenization Use Cases
Turning sentences into chunks is what powers gigantic of applications, enable them to process the data efficiently.
Machine Translation
Google Translate for instance break down texts into tokens that can be reconstructed into a corresponding tokens in the target language
Text Classification
In tasks like spam detections, the input converts into tokens that can be used to classify the text .
Search Engine
Search Engine like Google, divid the user query into tokens, which enable the engine to mach the query with relevent content.
Chatbot
Chatbot have trained to answer user's questions by feeding it myriad tokens. These tokens enabled the chatbot to understind and respond to user's query effectively.
Now we are ready to build a LLM using the tokenization we have created above for Sheakspear poem generations model. If you want to read more about LLM's behind scenes check this article Transformers .
You can explore the GPT-3 or 4 tokenizer here