Tokenization

Tokenization

sentence = "I love reading science fiction books or books about science."

 

Definition

Tokenization is the process of breaking down a text into smaller units called tokens.

 

tokenized_sentence = sentence.split(" ")
print(tokenized_sentence)
['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science.']

Counting token

from collections import Counter

token_counter = Counter(tokenized_sentence)
print(token_counter.most_common(3))
[('books', 2), ('I', 1), ('love', 1)]

 

tokenized_sentence = sentence.replace(".", " ").split(" ")

token_counter = Counter(tokenized_sentence)
print(token_counter.most_common(2))
[('science', 2), ('books', 2)]

NLTK tokenization

from nltk.tokenize import wordpunct_tokenize
from string import punctuation

tokenized_sentence = wordpunct_tokenize(sentence)
tokenized_sentence = [t for t in tokenized_sentence if t not in punctuation]
print(tokenized_sentence)
['I', 'love', 'reading', 'science', 'fiction', 'books', 'or', 'books', 'about', 'science']

Lemmatization

  • Reduce words to their base or canonical form
  • Represents the dictionary form of a word (lemma)
  • Standardizes words for better text analysis accuracy
  • Example: meeting –> meet (verb)
  • Helps in tasks such as text classification, information retrieval, and sentiment analysis
  • Considers context and linguistic rules
  • Retains semantic meaning of words
  • Has to involve part-of-speech tagging (see example below)
  • Determines correct lemma based on word’s role in sentence

flowchart LR
    A(meeting)
    A --> B("meet (verb)")
    A --> C("meeting (noun)")

Lemmatization with WordNet: Nouns

from nltk.stem import WordNetLemmatizer

sentence = "The three brothers went over three big bridges"

wnl = WordNetLemmatizer()

lemmatized_sentence_token = [
    wnl.lemmatize(w, pos="n") for w in sentence.split(" ")
]

print(lemmatized_sentence_token)
['The', 'three', 'brother', 'went', 'over', 'three', 'big', 'bridge']

Lemmatization with WordNet: Verbs

lemmatized_sentence_token = [
    wnl.lemmatize(w, pos="v") for w in sentence.split(" ")
]

print(lemmatized_sentence_token)
['The', 'three', 'brothers', 'go', 'over', 'three', 'big', 'bridge']

Lemmatization with WordNet and POS-tagging

pos_dict = {
  "brothers": "n", 
  "went": "v",
  "big": "a",
  "bridges": "n"
}

lemmatized_sentence_token = []
for token in sentence.split(" "):
    if token in pos_dict:
        lemma = wnl.lemmatize(token, pos=pos_dict[token])
    else: 
        lemma = token # leave as it is

    lemmatized_sentence_token.append(lemma)

print(lemmatized_sentence_token)
['The', 'three', 'brother', 'go', 'over', 'three', 'big', 'bridge']

Bit Pair Encoding

Bit Pair Encoding: Why?

  • Tokenization: Breaking text into smaller chunks (tokens)
  • Traditional vocabularies: Fixed-size, memory-intensive
  • Bit pair encoding: Compression technique for large vocabularies

Bit Pair Encoding: How?

  • Pair Identification: Identifies frequent pairs of characters
  • Replacement with Single Token: Replaces pairs with single token
  • Iterative Process: Continues until stopping criterion met
  • Vocabulary Construction: Construct vocabulary with single tokens
  • Encoding and Decoding: Text encoded and decoded using constructed vocabulary

OpenAI Tokenizer

Bit Pair Encoding: Pros and Cons

  • Efficient Memory Usage
  • Retains Information
  • Flexibility
  • Computational Overhead
  • Loss of Granularity