Tokenization

Simple word tokenization

A key element for a computer to understand the words we speak or type is the concept of word tokenization. For a human, the sentence

sentence = "I love reading science fiction books or books about science."

is easy to understand since we are able to split the sentence into its individual parts in order to figure out the meaning of the full sentence. For a computer, the sentence is just a simple string of characters, like any other word or longer text. In order to make a computer understand the meaning of a sentence, we need to help break it down into its relevant parts.

Simply put, word tokenization is the process of breaking down a piece of text into individual words or so-called tokens. It is like taking a sentence and splitting it into smaller pieces, where each piece represents a word. Word tokenization involves analyzing the text character by character and identifying boundaries between words. It uses various rules and techniques to decide where one word ends and the next one begins. For example, spaces, punctuation marks, and special characters often serve as natural boundaries between words.

So let’s start breaking down the sentence into its individual parts.

tokenized_sentence = sentence.split(" ")
print(tokenized_sentence)

['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science.']

Once we have tokenized the sentence, we can start analyzing it with some simple statistical methods. For example, in order to figure out what the sentence might be about, we could count the most frequent words.

from collections import Counter

token_counter = Counter(tokenized_sentence)
print(token_counter.most_common(2))

[('books', 2), ('I', 1)]

Unfortunately, we already realize that we have not done the best job with our “tokenizer”: The second occurrence of the word science is missing do to the punctuation. While this is great as it holds information about the ending of a sentence, it disturbs our analysis here, so let’s get rid of it.

tokenized_sentence = sentence.replace(".", " ").split(" ")

token_counter = Counter(tokenized_sentence)
print(token_counter.most_common(2))

[('science', 2), ('books', 2)]

So that worked. As you can imagine, tokenization can get increasingly difficult when we have to deal with all sorts of situations in larger corpora of texts (see also the exercise). So it is great that there are already all sorts of libraries available that can help us with this process.

from nltk.tokenize import wordpunct_tokenize
from string import punctuation

tokenized_sentence = wordpunct_tokenize(sentence)
tokenized_sentence = [t for t in tokenized_sentence if t not in punctuation]
print(tokenized_sentence)

['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science']

Advanced word tokenization

As you can imagine, the idea of tokenization does not stop here, we have only touched the basic idea of it. As a general strategy, we can now start from our word tokens and refine the tokens in a way that we need. A classic problem arising with the token above is that we do not get word tokens in their standardized form. So if, for example, we were to take all our token together and use that as a dictionary, we would get two different token for every words that appears both in singular and plural form (with the added “s”). Or we would receive different tokens for every verb in different conjugations (for example, “speak”, “speaks”, and “spoken”). Depending on our task, a great idea is to try and find the basic form of each token, a process called lemmatization.

Lemmatization

Lemmatization is a natural language processing technique used to reduce words to their base or canonical form, known as the lemma. The lemma represents the dictionary form of a word and is typically a valid word that exists in the language. Lemmatization helps in standardizing words so that different forms of the same word are treated as one, simplifying text analysis and improving accuracy in tasks such as text classification, information retrieval, and sentiment analysis.

Unlike stemming, which simply chops off prefixes or suffixes to derive the root form of a word (sometimes resulting in non-existent or incorrect forms), lemmatization considers the context of the word and applies linguistic rules to transform it into its lemma. This ensures that the resulting lemma is a valid word in the language and retains its semantic meaning.

Lemmatization involves part-of-speech tagging to determine the correct lemma for each word based on its role in the sentence. For example, the word “running” may be lemmatized to “run” as a verb, but to “running” as a noun. However, once we have lemmatized our text, we might lose some information due to the lost context.

Luckily for us, there are already pre-built packages that we can use to try out lemmatization. Here is a quick example how to do it with nltk.

from nltk.stem import WordNetLemmatizer

sentence = "The three brothers went over three big bridges"

wnl = WordNetLemmatizer()

lemmatized_sentence_token = [
    wnl.lemmatize(w, pos="n") for w in sentence.split(" ")
]

print(lemmatized_sentence_token)

['The', 'three', 'brother', 'went', 'over', 'three', 'big', 'bridge']

Since we need to include the pos (part-of-speech) tag of each word and only choose a noun (n) here, the lemmatizer only takes care of nouns in the sentence. Let’s try it with verbs.

lemmatized_sentence_token = [
    wnl.lemmatize(w, pos="v") for w in sentence.split(" ")
]

print(lemmatized_sentence_token)

['The', 'three', 'brothers', 'go', 'over', 'three', 'big', 'bridge']

This works, however, we now encounter a different problem: The word bridges has been turned into bridge, since it exists also as the ver “to bridge”. So, as mentioned above, we need to involve some part of speech tagging in order to do it correctly. Let’s try to fix it manually.

pos_dict = {
  "brothers": "n", 
  "went": "v",
  "big": "a",
  "bridges": "n"
}

lemmatized_sentence_token = []
for token in sentence.split(" "):
    if token in pos_dict:
        lemma = wnl.lemmatize(token, pos=pos_dict[token])
    else: 
        lemma = token # leave as it is

    lemmatized_sentence_token.append(lemma)

print(lemmatized_sentence_token)

['The', 'three', 'brother', 'go', 'over', 'three', 'big', 'bridge']

Again, luckily there also some packages that we can use, spaCy is one example. Their models come with a lot of built-in functionality.

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(sentence)

lemmatized_words = [(token.lemma_, token.pos_) for token in doc]

print(lemmatized_words)

[('the', 'DET'), ('three', 'NUM'), ('brother', 'NOUN'), ('go', 'VERB'), ('over', 'ADP'), ('three', 'NUM'), ('big', 'ADJ'), ('bridge', 'NOUN')]

Now that we have successfully lemmatized our document or text, we can start using the lemmas to do some further analysis. For example, we could build a dictionary and do some statistics on it. We will see more about that later.

Bit Pair Encoding

The above ideas illustrate well the idea of tokenization of splitting text into smaller chunks that we can feed to a language model. In practice, especially in models like GPT, a critical component is the vocabulary or the set of unique words or tokens the model understands. Traditional approaches use fixed-size vocabularies, which means every unique word in the corpus has its own representation (index or embedding) in the model’s vocabulary. However, as the vocabulary size increases (for example, by including more languages), so does the memory requirement, which can be impractical for large-scale language models. One solution is the so-called bit-pair encoding. Bit pair encoding is a data compression technique specifically designed to tackle the issue of large vocabularies in language models. Instead of assigning a unique index or embedding to each token, bit pair encoding identifies frequent pairs of characters (bits) within the corpus and represents them as a single token. This effectively reduces the size of the vocabulary while preserving the essential information needed for language modeling tasks.

How it works

Tokenization: The first step in bit pair encoding is tokenization, where the text corpus is broken down into individual tokens. These tokens could be characters, subwords, or words, depending on the tokenization strategy used.
Pair Identification: Next, the algorithm identifies pairs of characters (bits) that occur frequently within the corpus. These pairs are typically consecutive characters in the text.
Replacement with Single Token: Once frequent pairs are identified, they are replaced with a single token. This effectively reduces the number of unique tokens in the vocabulary.
Iterative Process: The process of identifying frequent pairs and replacing them with single tokens is iterative. It continues until a predefined stopping criterion is met, such as reaching a target vocabulary size or when no more frequent pairs can be found.
Vocabulary Construction: After the iterative process, a vocabulary is constructed, consisting of the single tokens generated through pair replacement, along with any remaining tokens from the original tokenization process.
Encoding and Decoding: During training and inference, text data is encoded using the constructed vocabulary, where each token is represented by its corresponding index in the vocabulary. During decoding, the indices are mapped back to their respective tokens.

Tip

It is very illustrative to use the the OpenAI tokenizer to see how a sentence is split up into different token. Try mixing languages and standard as well as more rare words and observe how they are split up.

Another detailed example can be found here.

Advantages of Bit Pair Encoding

Efficient Memory Usage: Bit pair encoding significantly reduces the size of the vocabulary, leading to more efficient memory usage, especially in large-scale language models.
Retains Information: Despite reducing the vocabulary size, bit pair encoding retains important linguistic information by capturing frequent character pairs.
Flexible: Bit pair encoding is flexible and can be adapted to different tokenization strategies and corpus characteristics.

Limitations and Considerations

Computational Overhead: The iterative nature of bit pair encoding can be computationally intensive, especially for large corpora.
Loss of Granularity: While bit pair encoding reduces vocabulary size, it may lead to a loss of granularity, especially for rare or out-of-vocabulary words.
Tokenization Strategy: The effectiveness of bit pair encoding depends on the tokenization strategy used and the characteristics of the corpus.

Tip

From the OpenAI Guide:

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).