Tokenization

sentence = "I love reading science fiction books or books about science."

Definition

Tokenization is the process of breaking down a text into smaller units called tokens.

tokenized_sentence = sentence.split(" ")
print(tokenized_sentence)

['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science.']

Counting token

from collections import Counter

token_counter = Counter(tokenized_sentence)
print(token_counter.most_common(3))

[('books', 2), ('I', 1), ('love', 1)]

tokenized_sentence = sentence.replace(".", " ").split(" ")

token_counter = Counter(tokenized_sentence)
print(token_counter.most_common(2))

[('science', 2), ('books', 2)]

NLTK tokenization

from nltk.tokenize import wordpunct_tokenize
from string import punctuation

tokenized_sentence = wordpunct_tokenize(sentence)
tokenized_sentence = [t for t in tokenized_sentence if t not in punctuation]
print(tokenized_sentence)

['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science']

Lemmatization

Reduce words to their base or canonical form
Represents the dictionary form of a word (lemma)
Standardizes words for better text analysis accuracy
Example: meeting –> meet (verb)

Helps in tasks such as text classification, information retrieval, and sentiment analysis
Considers context and linguistic rules
Retains semantic meaning of words
Has to involve part-of-speech tagging (see example below)
Determines correct lemma based on word’s role in sentence

flowchart LR
    A(meeting)
    A --> B("meet (verb)")
    A --> C("meeting (noun)")

Lemmatization with WordNet: Nouns

from nltk.stem import WordNetLemmatizer

sentence = "The three brothers went over three big bridges"

wnl = WordNetLemmatizer()

lemmatized_sentence_token = [
    wnl.lemmatize(w, pos="n") for w in sentence.split(" ")
]

print(lemmatized_sentence_token)

['The', 'three', 'brother', 'went', 'over', 'three', 'big', 'bridge']

Lemmatization with WordNet: Verbs

lemmatized_sentence_token = [
    wnl.lemmatize(w, pos="v") for w in sentence.split(" ")
]

print(lemmatized_sentence_token)

['The', 'three', 'brothers', 'go', 'over', 'three', 'big', 'bridge']

Lemmatization with WordNet and POS-tagging

pos_dict = {
  "brothers": "n", 
  "went": "v",
  "big": "a",
  "bridges": "n"
}

lemmatized_sentence_token = []
for token in sentence.split(" "):
    if token in pos_dict:
        lemma = wnl.lemmatize(token, pos=pos_dict[token])
    else: 
        lemma = token # leave as it is

    lemmatized_sentence_token.append(lemma)

print(lemmatized_sentence_token)

['The', 'three', 'brother', 'go', 'over', 'three', 'big', 'bridge']

Bit Pair Encoding

Bit Pair Encoding: Why?

Tokenization: Breaking text into smaller chunks (tokens)
Traditional vocabularies: Fixed-size, memory-intensive
Bit pair encoding: Compression technique for large vocabularies

Bit Pair Encoding: How?

Pair Identification: Identifies frequent pairs of characters
Replacement with Single Token: Replaces pairs with single token
Iterative Process: Continues until stopping criterion met
Vocabulary Construction: Construct vocabulary with single tokens
Encoding and Decoding: Text encoded and decoded using constructed vocabulary

OpenAI Tokenizer

Bit Pair Encoding: Pros and Cons

Efficient Memory Usage
Retains Information
Flexibility
Computational Overhead
Loss of Granularity