Large Language Models: An ‘intuitive’ introduction

Background

Rise of machine learning and deep learning in the 2000s

Idea: Use neural networks for NLP tasks
Challenge: How do we feed text into a neural network?

Answer: Text embeddings!

Example from Bag of Words

flowchart LR
  A(A cat does cat things) --> B{" "}
  B --> C(A)
  B --> D(cat)
  B --> E(does)
  B --> F(cat)
  B --> G(things)
  D --> H(cat)
  E --> I(do)
  F --> J(cat)
  G --> K(thing)

  H --> L(cat: 2)
  J --> L
  I --> M(do: 1)
  K --> N(thing: 1)

  L --> O(2)
  M --> P(1)
  N --> Q(1)

How could a neural network look like?

flowchart LR
  A(Input Text) --> B(Tokenization)
  B --> C(Token processing)
  C --> D(Embedding Layer)
  D --> E(Hidden Layers)
  E --> F(Output Layer)

The hidden layers and output layers depend on the application
The rest of the layers can be pre-trained (later)

Example: Text classification

flowchart LR
  A(Input Text) --> B(Tokenization)
  B --> C(Token processing)
  C --> D(Embedding Layer)
  D --> E(Hidden Layers)
  E --> F(Output Layer)

Classifying news articles into categories (sports, politics, …)

Training data: Dataset with corresponding category or label
Data processing: Tokenization, stop words, lower casing etc.
Training:
- Measure the difference between predicted and true labels and adjust network weights
- Example: BoW embeddings - frequency of words affects label likelihood.

Sequence Generation and Language Modeling

Sequence generation

Idea: Models are trained to generate sequences of data (mostly: text) based on input/context.
Sequences have to resemble the training data.
Application: Text generation, music composition, image captioning
Requires understanding language structure for meaningful output!

Language modeling

Idea: Train a model to predict the probability distribution of words or tokens in a sequence given the preceding context!

flowchart LR
  A(The) --> B(bird)
  B --> C(flew)
  C --> D(over)
  D --> E(the)
  E --> F{?}
  F --> G("p(rooftops)=0.31")
  F --> H("p(trees)=0.14")
  F --> J("p(guitar)=0.001")

Training Process

Expose the model to large text datasets (great: we have the internet!)
Teach the model statistical properties of language (Which token comes next?)
Capture syntactic structures, semantic relationships, and contextual nuances
Training happens in an unsupervised fashion (we require no labels!)

Challenges

Handling vast and diverse nature of human language.
Complex patterns, variations, and ambiguities.
Out-of-vocabulary words, long-range dependencies, domain-specific knowledge.
Requires robust architectures and sophisticated algorithms.

BUT: They did it and it works!

GPT: Generative Pre-trained Transformer

What is GPT?

current state-of-the-art language model
introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017 (Google)
GPT belongs to the family of transformer-based models
key advantage over previous approaches:
- self-attention
- scalability

What is a transformer?

Traditional approach:
- information flow constrained by fixed-length context windows or recurrent connections
- One token at a time (in RNNs)

New approach:
- Each word in a sentence can attend to all other words simultaneously (self-attention)
- Dynamically weigh the importance of each word in the context of the entire sequence
- Semantically related words receive higher attention weights
- Irrelevant or less informative words receive lower weights
- Processing all sequences in parallel

A deeper dive: What an LLM wants to do!

flowchart LR
  A(Token 1)
  B(Token 2) 
  C(Token 3)
  D(...)
  E(Token k)  

  F(The core of the LLM)

  A --> F
  B --> F
  C --> F
  D --> F
  E --> F

  AA(Prob. dist. 1st output token)
  BB(Prob. dist. 2nd output token)
  CC(...)
  DD(Prob. dist. nth output token)

  F --> AA
  F --> BB
  F --> CC
  F --> DD

How can we generate a probability distribution?

Idea: Transform a vector \(z\) of numbers into a probability distribution!

\[ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}} \]

Softmax input z: [2.0, 1.0, 0.1, -1.0, 2.0]
Softmax output: [0.39, 0.14, 0.06, 0.02, 0.39]

Great, so let’s try to generate vectors with our model!

A deeper dive: What an LLM looks like!

flowchart LR
  A(Input Text) --> B(Tokenization) --> C(Token Embedding) -.-> X:::hidden 
  
  Y:::hidden -.-> D(Attention) --> E(Multilayer Perceptron) --> F(Attention) --> G(Multilayer Perceptron) --> H(...) -.-> Z:::hidden
  
  XX:::hidden -.-> I(Unembedding) --> J(Probabilities) --> K(Token Output)

General idea: Transform input vectors (embeddings) over and over such that the result encodes all the context!

Token embeddings

input_text = "A blue guitar is"

Tokenization:

flowchart LR
  A(A) --> B(blue) --> C(guitar) --> D(is) --> E(?)

Embedding:

\[W_E = \begin{bmatrix} 0.25 & -0.73 & 0.58 & 0.44 \\ -0.53 & 0.57 & -0.61 & 0.70 \\ 0.11 & 0.33 & 0.22 & -0.49 \\ -0.80 & -0.08 & -0.73 & 0.29 \\ \end{bmatrix} \]

Token embeddings

One embedding \(E_k\) per token (lookup table, but it’s being learned during training!)
No connection (“context”) between token embeddings at first
But: Directional information contained in embeddings which encodes “meaning”

Embedding visualization

Unembedding, probabilities and token output

What happens at the end of the LLM?

flowchart LR
  A(A) --> AA("$$E_1$$")
  B(blue) --> BB("$$E_2$$")
  C(guitar) --> CC("$$E_3$$")
  D(is) --> DD("$$E_4$$") 

  E(The core of the LLM)

  AA --> E
  BB --> E
  CC --> E
  DD --> E



  F(Unembedding matrix)

  E --> F

  AAA("$$U_1$$")
  BBB("$$U_2$$")
  CCC("$$U_3$$")

  F --> AAA
  F --> BBB
  F --> CCC

  G(Softmax) --> GG(Prob. dist. 1st output token)
  H(Softmax) --> HH(Prob. dist. 2nd output token)
  I(Softmax) --> II(Prob. dist 3rd output token)
  
  AAA --> G
  BBB --> H
  CCC --> I

  GG --> X("p(great)=0.31")
  GG --> Y("p(aweful)=0.14")
  GG --> Z("p(train)=0.001")

Understanding the core: Attention

Problem: Embedding vectors so far do not interact with each other, so we cannot process “context”.
Solution: Let embedding vectors \(E_k\) share information!

flowchart LR 
  A("$$E_1^{(1)}$$")
  B("$$E_2^{(1)}$$")
  C("$$E_3^{(1)}$$")
  D("$$E_k^{(1)}$$")

  E("Transformation ('Attention')")

  A --> E
  B --> E
  C --> E
  D --> E

  F("$$E_1^{(2)}$$")
  G("$$E_2^{(2)}$$")
  H("$$E_3^{(2)}$$")
  I("$$E_k^{(2)}$$")

  E --> F
  E --> G
  E --> H
  E --> I

–

Understanding the core: Attention

\[ \text{Attention}(Q, K, V) = \text{softmax}(K^T Q)V \]

\(Q\) is the “query”, asking a question
\(K\) is the “key”, answering that question
\(\text{softmax}\) turns the result into a weight (“How important is that question and its answer?”)
Then we multiply with the value matrix \(V\) that transforms the initial embedding vectors, that now encode shared information!

Understanding the core: Attention

\[ \text{Attention}(Q, K, V) = \text{softmax}(K^T Q)V \]

\[K_i = E_i * W_K\] \[Q_i = E_i * W_Q\]

\[ \text{softmax}( \begin{bmatrix} K_1 \cdot Q_1 & K_1 \cdot Q_2 & K_1 \cdot Q_3 \\ K_2 \cdot Q_1 & K_2 \cdot Q_2 & K_2 \cdot Q_3 \\ K_3 \cdot Q_1 & K_3 \cdot Q_2 & K_3 \cdot Q_3 \end{bmatrix}) V \]

Multilayer Perceptron

A “standard” neural network which works in the same way that other work.
Can be thought of as “more parameters to tune in the network”!
Won’t go into details here.

Put it all together

flowchart LR
  A(A) --> AA("$$E_1$$")
  B(blue) --> BB("$$E_2$$")
  C(guitar) --> CC("$$E_3$$")
  D(is) --> DD("$$E_4$$") 

  E(Attention and MP)

  AA --> E
  BB --> E
  CC --> E
  DD --> E

  REP(Repeat 10k times)

  F(Unembedding matrix)

  E --> REP
  REP --> F

  AAA("$$U_1$$")
  BBB("$$U_2$$")
  CCC("$$U_3$$")

  F --> AAA
  F --> BBB
  F --> CCC

  G(Softmax) --> GG(Prob. dist. 1st output token)
  H(Softmax) --> HH(Prob. dist. 2nd output token)
  I(Softmax) --> II(Prob. dist 3rd output token)
  
  AAA --> G
  BBB --> H
  CCC --> I

  GG --> X("p(great)=0.31")
  GG --> Y("p(aweful)=0.14")
  GG --> Z("p(train)=0.001")

We have created a machine that can generate the most likely next token!

There are some details that need consideration

Details of the Multilayer Perceptron layers
Training of an LLM involves a second step (reinforcement learning)
Sampling, temperature scaling, \(\text{top_}{p}\) (How can an LLM be creative?)
How does the memory of an LLM work?
Many, many details we skipped!

But for now, let’s get started using one!