Statistical text analysis

So far we have mainly looked at the analysis of single words/token or n-grams. But what about the analysis of a full text? There are many approaches to this but a good way to get into the topic is a simple statistical analysis of a text. For starters, let’s simply count the number of appearances of each word in a text, also known as term frequency.

from nltk.tokenize import wordpunct_tokenize
from string import punctuation
from collections import Counter
from typing import List

from nltk.corpus import stopwords
# python -m nltk.downloader stopwords -> run this in your console once to get the stopwords


# load a text from file
text = ""
with open("../assets/chapter1.txt", "r") as file:  
    for line in file:
        text += line.strip()


def preprocess_text(text: str) -> List[str]:
    # tokenize text
    tokens = wordpunct_tokenize(text.lower())

    # remove punctuation
    tokens = [t for t in tokens if t not in punctuation]

    # remove stopwords
    stop_words = stopwords.words("english")
    tokens = [t for t in tokens if t not in stop_words]

    return tokens

# count the most frequent words
tokens = preprocess_text(text=text)

for t in Counter(tokens).most_common(15):
    print(f"{t[0]}: {t[1]}")

one: 35
winston: 32
face: 28
even: 24
--: 24
big: 22
could: 19
party: 18
would: 18
moment: 18
like: 17
brother: 15
goldstein: 15
telescreen: 14
seemed: 14

Just from the most frequent words, can you guess the text?

In many cases, just the simple number of appearances of a token in a text can determine its importance. The concept of counting the term frequency across multiple documents in order to create a fixed vocabulary is also known as bag of words.

Bag of Words

If we do the same with multiple texts, we can build up a vocabulary of words and compare different texts to each other based on the appearance of terms.

from collections import Counter


def create_bag_of_words(texts):
    # Count the frequency of each word in the corpus
    word_counts = Counter()
    
    for text in texts:
        # Preprocess the text
        words = preprocess_text(text)
        
        # Update word counts
        word_counts.update(words)
    
    # Create vocabulary by sorting the words based on their frequency
    vocabulary = [word for word, _ in sorted(word_counts.items())]
    
    # Create BoW vectors for each document
    bow_vectors = []
    for text in texts:
        # Preprocess the text
        words = preprocess_text(text)
        
        # Create a Counter object to count word frequencies
        bow_vector = Counter(words)
        
        # Fill in missing words with zero counts
        for word in vocabulary:
            if word not in bow_vector:
                bow_vector[word] = 0

        # Sort the BoW vector based on the vocabulary order
        sorted_bow_vector = [bow_vector[word] for word in vocabulary]
        
        # Append the BoW vector to the list
        bow_vectors.append(sorted_bow_vector)
    
    return vocabulary, bow_vectors

# Example texts
texts = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Create Bag of Words
vocabulary, bow_vectors = create_bag_of_words(texts)

# Print vocabulary
print("Vocabulary:")
print(vocabulary)

# Print BoW vectors
print("\nBag of Words Vectors:")
for i, bow_vector in enumerate(bow_vectors):
    print(f"Document {i + 1}: {bow_vector}")

Vocabulary:
['document', 'first', 'one', 'second', 'third']

Bag of Words Vectors:
Document 1: [1, 1, 0, 0, 0]
Document 2: [2, 0, 0, 1, 0]
Document 3: [0, 0, 1, 0, 1]
Document 4: [1, 1, 0, 0, 0]

Bag of words actually gives us some vector representation of our texts with respect to the given vocabulary. We can even calculate with these vectors and try to determine a similarity between the texts.

import numpy as np

def cosine_similarity(vec1: np.array, vec2: np.array) -> float: 
    return np.dot(vec1, vec2) / ( np.linalg.norm(vec1) * np.linalg.norm(vec2) )


query = bow_vectors[3]

similarities = []
for i, bv in enumerate(bow_vectors):

    similarity = cosine_similarity(
            vec1=query, 
            vec2=bv
        )

    similarities.append(
        (texts[i], round(similarity, 2))
    )

similarities

[('This is the first document.', 1.0),
 ('This document is the second document.', 0.63),
 ('And this is the third one.', 0.0),
 ('Is this the first document?', 1.0)]

Limitations of Term Frequency and Bag of Words

Pure statistical text analysis methods like Term Frequency (also its extension Term Frequency-Inverse Document Frequency) or Bag of Words are usually a convenient starting point for text analysis. However, while they offer useful insights into textual data, they are not without their limitations.

One significant drawback is the sheer size of the vocabulary they handle. As the corpus grows, so does the vocabulary, which can become overwhelmingly large, leading to computational inefficiencies and increased memory requirements.

Moreover, the mentioned methods struggle with out-of-vocabulary words. Since the vocabulary is fixed after the training has finished (the bag of words has been created with plenty of documents), words not present in the vocabulary are often either ignored, leading to information loss, or arbitrarily handled, potentially skewing the analysis results. This limitation becomes particularly pronounced in domains with specialized jargon or evolving lexicons.

Another critical limitation is the lack of context in these approaches. By treating each word independently and ignoring their sequential and syntactical relationships, TF and Bag of Words fail to capture the nuanced meanings embedded in language. This deficiency hampers their ability to comprehend subtleties such as sarcasm, irony, or metaphors, limiting their applicability in tasks requiring deeper semantic understanding.

Last but not least, these methods lack structural awareness. They disregard the hierarchical and syntactic structures inherent in language, missing out on essential cues provided by sentence and paragraph boundaries. Thus they often struggle to differentiate between sentences with similar word distributions but differing in meaning or intent.

As a consequence, more sophisticated methods are required to really start understanding human language in a more sophisticated way. But before we get to such methods, let’s try one more thing.

Clustering of Bag of Word vectors

As we’ve seen above, the idea of BoW already gives us a rather simple possibility to compare texts to each other. Whenever we can compare different entities to each other (here: the texts), a pretty straight-forward extension is to try and find clusters, that is, groups of similar texts.

Clustering using bag-of-word vectors is a common technique in text analysis for grouping similar documents together based on the similarity of their word distributions. As seen above, each document is represented as a high-dimensional vector, with each dimension corresponding to a unique word in the vocabulary and its value reflecting the frequency of that word in the document. By treating documents as points in a high-dimensional space, clustering algorithms such as K-means or hierarchical clustering can be applied to partition the documents into coherent groups. The similarity between documents is typically measured using distance metrics such as cosine similarity or Euclidean distance, which quantify the degree of overlap between their word distributions.

One advantage of using bag-of-word vectors for clustering is its simplicity and scalability. Since the vectors only capture the frequency of words without considering their order or context, the computational complexity remains manageable even for large datasets with extensive vocabularies. However, clustering based on bag-of-word vectors also has its limitations. One major drawback is the reliance on word frequency alone, which may overlook important semantic similarities between documents. Additionally, the curse of dimensionality can become a challenge as the size of the vocabulary increases, leading to decreased clustering performance and increased computational overhead. Despite these limitations, clustering using bag-of-word vectors serves as a foundational approach in text analysis, providing valuable insights into document similarity and aiding tasks such as document organization, topic modeling, and information retrieval.

Let’s finish up with a small code example.

from sklearn.cluster import KMeans

# do K-means clustering
n_clusters = 2  # specify the number of clusters
kmeans = KMeans(n_clusters=n_clusters, n_init="auto")
cluster_labels = kmeans.fit_predict(bow_vectors)

print("Cluster labels:\n")
for i, label in enumerate(cluster_labels):
    print(f"Document {i + 1} belongs to Cluster {label + 1}")

Cluster labels:

Document 1 belongs to Cluster 1
Document 2 belongs to Cluster 1
Document 3 belongs to Cluster 2
Document 4 belongs to Cluster 1

Let’s maybe do this with some “texts” that are more likely to actually create some clusters. And, of course, there are packages that can do the Bag of Words for us. Here we use scikit-learn.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import numpy as np

# Example texts representing different topics
texts = [
    "apple orange banana",
    "apple orange mango",
    "banana apple kiwi",
    "strawberry raspberry blueberry",
    "strawberry raspberry blackberry"
]

# create Bag of Words using CountVectorizer
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(texts)

print("Bag of Word vectors:")
print(bow_matrix.toarray())

# perform K-means clustering
num_clusters = 2
kmeans = KMeans(n_clusters=num_clusters, n_init="auto")
cluster_labels = kmeans.fit_predict(bow_matrix)

print("\nCluster labels:")
for i, label in enumerate(cluster_labels):
    print(f"Document {i + 1} belongs to Cluster {label + 1}")

Bag of Word vectors:
[[1 1 0 0 0 0 1 0 0]
 [1 0 0 0 0 1 1 0 0]
 [1 1 0 0 1 0 0 0 0]
 [0 0 0 1 0 0 0 1 1]
 [0 0 1 0 0 0 0 1 1]]

Cluster labels:
Document 1 belongs to Cluster 2
Document 2 belongs to Cluster 2
Document 3 belongs to Cluster 2
Document 4 belongs to Cluster 1
Document 5 belongs to Cluster 1

# Get the vocabulary
vocabulary = vectorizer.get_feature_names_out()

# print the vocabulary
print("Vocabulary:")
print(vocabulary)

# print the Bag of Words matrix with corresponding words
print("\nBag of Words matrix with corresponding words:")
bow_matrix_array = bow_matrix.toarray()
for i, document_vector in enumerate(bow_matrix_array):
    words_in_document = [(word, frequency) for word, frequency in zip(vocabulary, document_vector) if frequency > 0]
    print(f"Document {i + 1}: {words_in_document}")

Vocabulary:
['apple' 'banana' 'blackberry' 'blueberry' 'kiwi' 'mango' 'orange'
 'raspberry' 'strawberry']

Bag of Words matrix with corresponding words:
Document 1: [('apple', 1), ('banana', 1), ('orange', 1)]
Document 2: [('apple', 1), ('mango', 1), ('orange', 1)]
Document 3: [('apple', 1), ('banana', 1), ('kiwi', 1)]
Document 4: [('blueberry', 1), ('raspberry', 1), ('strawberry', 1)]
Document 5: [('blackberry', 1), ('raspberry', 1), ('strawberry', 1)]