Exercise: TF-IDF

Task: Extend the code for the bag of words to TF-IDF (Term Frequency-Inverse Document Frequency) vectors for a given set of documents. TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. This measure helps in identifying words that are unique and informative to a particular document while downweighting common words that appear across many documents.

TF-IDF consists of two main components:

Term Frequency (TF): This component measures how frequently a term occurs in a document. It is calculated as the ratio of the count of a term in a document to the total number of terms in the document. TF is higher for words that occur more frequently within a document.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

Inverse Document Frequency (IDF): This component measures the rarity of a term across the entire corpus of documents. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term, plus one. IDF is higher for terms that are rare across documents but present in only a few documents.

IDF(t) = log((1 + Total number of documents) / (1 + Number of documents containing term t))

The TF-IDF score for a term in a document is obtained by multiplying its TF and IDF scores. This score reflects the importance of the term in the context of the document and the entire corpus.

Instructions:

Implement functions calculate_tf and calculate_idf to calculate Term Frequency (TF) and Inverse Document Frequency (IDF) respectively.
Write a create_tf_idf function to create TF-IDF vectors for a given set of documents. This function should count the frequency of each word in the corpus, calculate TF and IDF, and compute TF-IDF vectors for each document.

Show solution

from nltk.tokenize import wordpunct_tokenize
from string import punctuation
from typing import List

from nltk.corpus import stopwords
# python -m nltk.downloader stopwords -> run this in your console once to get the stopwords

def preprocess_text(text: str) -> List[str]:
    # tokenize text
    tokens = wordpunct_tokenize(text.lower())

    # remove punctuation
    tokens = [t for t in tokens if t not in punctuation]

    # remove stopwords
    stop_words = stopwords.words("english")
    tokens = [t for t in tokens if t not in stop_words]

    return tokens

from collections import Counter
import math


def calculate_tf(word_counts, total_words):
    # Calculate Term Frequency (TF)
    tf = {}
    for word, count in word_counts.items():
        tf[word] = count / total_words
    return tf

def calculate_idf(word_counts, num_documents):
    # Calculate Inverse Document Frequency (IDF)
    idf = {}
    for word, count in word_counts.items():
        idf[word] = math.log((1 + num_documents) / (1 + count))
    return idf

def create_tf_idf(texts):
    # Count the frequency of each word in the corpus and total number of words
    word_counts = Counter()
    total_words = 0
    for text in texts:
        # Preprocess the text
        words = preprocess_text(text)
        
        # Update word counts and total number of words
        word_counts.update(words)
        total_words += len(words)
    
    # Create sorted vocabulary
    vocabulary = sorted(word_counts.keys())
    
    # Calculate TF-IDF for each document
    tf_idf_vectors = []
    num_documents = len(texts)
    for text in texts:
        # Preprocess the text
        words = preprocess_text(text)
        
        # Calculate TF for the document
        tf = calculate_tf(Counter(words), len(words))
        
        # Calculate IDF based on word counts across all documents
        idf = calculate_idf(word_counts, num_documents)
        
        # Calculate TF-IDF for the document
        tf_idf_vector = {}
        for word in vocabulary:
            tf_idf_vector[word] = round(tf.get(word, 0) * idf[word], 2)
        
        # Sort the IFIDF vector based on the vocabulary order
        sorted_tfidf_vector = [tf_idf_vector[word] for word in vocabulary]
        
        # Append the BoW vector to the list
        tf_idf_vectors.append(sorted_tfidf_vector)
    
    return vocabulary, tf_idf_vectors

# Example texts
texts = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Create TF-IDF vectors
vocabulary, tf_idf_vectors = create_tf_idf(texts)

# Print vocabulary
print("Vocabulary:")
print(vocabulary)

# Print TF-IDF vectors
print("\nTF-IDF Vectors:")
for i, tf_idf_vector in enumerate(tf_idf_vectors):
    print(f"Document {i + 1}: {tf_idf_vector}")

Vocabulary:
['document', 'first', 'one', 'second', 'third']

TF-IDF Vectors:
Document 1: [0.0, 0.26, 0.0, 0.0, 0.0]
Document 2: [0.0, 0.0, 0.0, 0.31, 0.0]
Document 3: [0.0, 0.0, 0.46, 0.0, 0.46]
Document 4: [0.0, 0.26, 0.0, 0.0, 0.0]