So far we have mainly looked at the analysis of single words/token or n-grams. But what about the analysis of a full text? There are many approaches to this but a good way to get into the topic is a simple statistical analysis of a text. For starters, let’s simply count the number of appearances of each word in a text, also known as term frequency.
from nltk.tokenize import wordpunct_tokenizefrom string import punctuationfrom collections import Counterfrom typing import Listfrom nltk.corpus import stopwords# python -m nltk.downloader stopwords -> run this in your console once to get the stopwords# load a text from filetext =""withopen("../assets/chapter1.txt", "r") asfile: for line infile: text += line.strip()def preprocess_text(text: str) -> List[str]:# tokenize text tokens = wordpunct_tokenize(text.lower())# remove punctuation tokens = [t for t in tokens if t notin punctuation]# remove stopwords stop_words = stopwords.words("english") tokens = [t for t in tokens if t notin stop_words]return tokens# count the most frequent wordstokens = preprocess_text(text=text)for t in Counter(tokens).most_common(15):print(f"{t[0]}: {t[1]}")
Just from the most frequent words, can you guess the text?
In many cases, just the simple number of appearances of a token in a text can determine its importance. The concept of counting the term frequency across multiple documents in order to create a fixed vocabulary is also known as bag of words.
Bag of Words
If we do the same with multiple texts, we can build up a vocabulary of words and compare different texts to each other based on the appearance of terms.
from collections import Counterdef create_bag_of_words(texts):# Count the frequency of each word in the corpus word_counts = Counter()for text in texts:# Preprocess the text words = preprocess_text(text)# Update word counts word_counts.update(words)# Create vocabulary by sorting the words based on their frequency vocabulary = [word for word, _ insorted(word_counts.items())]# Create BoW vectors for each document bow_vectors = []for text in texts:# Preprocess the text words = preprocess_text(text)# Create a Counter object to count word frequencies bow_vector = Counter(words)# Fill in missing words with zero countsfor word in vocabulary:if word notin bow_vector: bow_vector[word] =0# Sort the BoW vector based on the vocabulary order sorted_bow_vector = [bow_vector[word] for word in vocabulary]# Append the BoW vector to the list bow_vectors.append(sorted_bow_vector)return vocabulary, bow_vectors# Example textstexts = ["This is the first document.","This document is the second document.","And this is the third one.","Is this the first document?",]# Create Bag of Wordsvocabulary, bow_vectors = create_bag_of_words(texts)# Print vocabularyprint("Vocabulary:")print(vocabulary)# Print BoW vectorsprint("\nBag of Words Vectors:")for i, bow_vector inenumerate(bow_vectors):print(f"Document {i +1}: {bow_vector}")
Bag of words actually gives us some vector representation of our texts with respect to the given vocabulary. We can even calculate with these vectors and try to determine a similarity between the texts.
[('This is the first document.', 1.0),
('This document is the second document.', 0.63),
('And this is the third one.', 0.0),
('Is this the first document?', 1.0)]
Limitations of Term Frequency and Bag of Words
Pure statistical text analysis methods like Term Frequency (also its extension Term Frequency-Inverse Document Frequency) or Bag of Words are usually a convenient starting point for text analysis. However, while they offer useful insights into textual data, they are not without their limitations.
One significant drawback is the sheer size of the vocabulary they handle. As the corpus grows, so does the vocabulary, which can become overwhelmingly large, leading to computational inefficiencies and increased memory requirements.
Moreover, the mentioned methods struggle with out-of-vocabulary words. Since the vocabulary is fixed after the training has finished (the bag of words has been created with plenty of documents), words not present in the vocabulary are often either ignored, leading to information loss, or arbitrarily handled, potentially skewing the analysis results. This limitation becomes particularly pronounced in domains with specialized jargon or evolving lexicons.
Another critical limitation is the lack of context in these approaches. By treating each word independently and ignoring their sequential and syntactical relationships, TF and Bag of Words fail to capture the nuanced meanings embedded in language. This deficiency hampers their ability to comprehend subtleties such as sarcasm, irony, or metaphors, limiting their applicability in tasks requiring deeper semantic understanding.
Last but not least, these methods lack structural awareness. They disregard the hierarchical and syntactic structures inherent in language, missing out on essential cues provided by sentence and paragraph boundaries. Thus they often struggle to differentiate between sentences with similar word distributions but differing in meaning or intent.
As a consequence, more sophisticated methods are required to really start understanding human language in a more sophisticated way. But before we get to such methods, let’s try one more thing.
Clustering of Bag of Word vectors
As we’ve seen above, the idea of BoW already gives us a rather simple possibility to compare texts to each other. Whenever we can compare different entities to each other (here: the texts), a pretty straight-forward extension is to try and find clusters, that is, groups of similar texts.
Clustering using bag-of-word vectors is a common technique in text analysis for grouping similar documents together based on the similarity of their word distributions. As seen above, each document is represented as a high-dimensional vector, with each dimension corresponding to a unique word in the vocabulary and its value reflecting the frequency of that word in the document. By treating documents as points in a high-dimensional space, clustering algorithms such as K-means or hierarchical clustering can be applied to partition the documents into coherent groups. The similarity between documents is typically measured using distance metrics such as cosine similarity or Euclidean distance, which quantify the degree of overlap between their word distributions.
One advantage of using bag-of-word vectors for clustering is its simplicity and scalability. Since the vectors only capture the frequency of words without considering their order or context, the computational complexity remains manageable even for large datasets with extensive vocabularies. However, clustering based on bag-of-word vectors also has its limitations. One major drawback is the reliance on word frequency alone, which may overlook important semantic similarities between documents. Additionally, the curse of dimensionality can become a challenge as the size of the vocabulary increases, leading to decreased clustering performance and increased computational overhead. Despite these limitations, clustering using bag-of-word vectors serves as a foundational approach in text analysis, providing valuable insights into document similarity and aiding tasks such as document organization, topic modeling, and information retrieval.
Let’s finish up with a small code example.
from sklearn.cluster import KMeans# do K-means clusteringn_clusters =2# specify the number of clusterskmeans = KMeans(n_clusters=n_clusters, n_init="auto")cluster_labels = kmeans.fit_predict(bow_vectors)print("Cluster labels:\n")for i, label inenumerate(cluster_labels):print(f"Document {i +1} belongs to Cluster {label +1}")
Cluster labels:
Document 1 belongs to Cluster 1
Document 2 belongs to Cluster 1
Document 3 belongs to Cluster 2
Document 4 belongs to Cluster 1
Let’s maybe do this with some “texts” that are more likely to actually create some clusters. And, of course, there are packages that can do the Bag of Words for us. Here we use scikit-learn.
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.cluster import KMeansimport numpy as np# Example texts representing different topicstexts = ["apple orange banana","apple orange mango","banana apple kiwi","strawberry raspberry blueberry","strawberry raspberry blackberry"]# create Bag of Words using CountVectorizervectorizer = CountVectorizer()bow_matrix = vectorizer.fit_transform(texts)print("Bag of Word vectors:")print(bow_matrix.toarray())# perform K-means clusteringnum_clusters =2kmeans = KMeans(n_clusters=num_clusters, n_init="auto")cluster_labels = kmeans.fit_predict(bow_matrix)print("\nCluster labels:")for i, label inenumerate(cluster_labels):print(f"Document {i +1} belongs to Cluster {label +1}")
# Get the vocabularyvocabulary = vectorizer.get_feature_names_out()# print the vocabularyprint("Vocabulary:")print(vocabulary)# print the Bag of Words matrix with corresponding wordsprint("\nBag of Words matrix with corresponding words:")bow_matrix_array = bow_matrix.toarray()for i, document_vector inenumerate(bow_matrix_array): words_in_document = [(word, frequency) for word, frequency inzip(vocabulary, document_vector) if frequency >0]print(f"Document {i +1}: {words_in_document}")