How Deep is Your L(ove)LM? What Makes a Neural Network ‘Deep’ When Building GPTs

37 minute read

Published: August 18, 2025

Introduction

What’s this? A Statistician caught off his guard? Who would have thought that I would now be again ascending up the gradient to the world’s current global maxima: large language models. Unfortunately for LLMs, however, I have an incessant need to understand why and how things work, and so rather than flaunting the at-times impressive capabilities of LLMs or AI more generally, this blog instead deconstructs them to their foundations so we can explore what is truly behind that shiny mask. This blog follows on from my previous post on how the ‘attention’ mechanism works in GPTs, but comes with a much larger scope: to explore what makes a neural network ‘deep’ and what the resulting impact of this is on the training process for a LLM and the methods used to scale this requirement of depth to avoid serious pitfalls.

En garde!

Code replication

As always with neural networks and LLMs specifically, we are going to be using PyTorch to build our architecture because it is the gold standard for this sort of thing. Broadly, this post aims to replicate the general code approach to construct a bigram large language model presented in this incredible video but with a few key changes/differences in motivation:

We are going to train our model on the script for The Lord of the Rings: Fellowship of the Ring instead of Tiny Shakespeare because I love LOTR so much it defies description
We are going to experiment with key hyperparameters to explore the resulting impact on training loss, validation loss, and generated text output

NOTE: The code to run the training loops for these neural networks takes a VERY long time if you do not have a CUDA-supported GPU and therefore need to run it on a CPU. Please be warned.

We can start by loading the libraries we will need:

import numpy as np
import requests # Just to download .txt data from the internet
import torch
import torch.nn as nn
from torch.nn import functional as F

What is a bigram language model?

Just as a quick introduction to position what our overall model is designed to do, I’ll briefly describe what a bigram model entails and what some alternative model approaches are. Basically, a bigram language model is the simplest language model: it predicts the probability of the next character (or word in a word-based architecture) given the preceding character or word. This approach is the simplest case of a more general class of N-gram models which are sequences of characters or words that can estimate the probability of a word given the previous words, and therefore assign probability distributions to entire sequences.

It is important to state that N-gram models have been vastly superseded by the approaches used in tools such as ChatGPT, Gemini, and Claude. However, training models of that complexity is far beyond the scope of a humble blog post (and my computer’s GPU), so we will stick with the bigram model for our purposes.

Full model code

Before we get on to our experiments, let’s define the entire language model in code. Thankfully, we can break up the code into discretised modules to make it a bit easier for you to follow along. Here I’ll decompose it into the following:

Data
Hyperparameters
Batching
Self-attention
Model
Loss estimation
Training loop

Each is presented and explained in turn. Fair warning for readers that are not programmers: this is a decent amount of code (\(\approx 230\) total lines of Python), but never fear! I will explain it all as we go.

1. Data

We start with the simplest part: the data. First we download the Fellowship of the Ring movie script, find all the unique characters to form our vocabulary, and define some simple encoding and decoding methods to convert the unique characters to integers (for the purpose of machine learning) and the integers back to characters to reconstruct the text.

# Download the data and convert to text

text = requests.get("https://raw.githubusercontent.com/eDubrovsky/movie_scripts/refs/heads/master/Lord-of-the-Rings-Fellowship-of-the-Ring%2C-The.txt")
text = text.text

# Create vocabulary using unique characters

chars = sorted(list(set(text))) # Unique characters our character-level model can generate
vocab_size = len(chars) # Number of potential characters (i.e., our output will be a probability distribution over this number of characters)
print(''.join(chars)) # Concatenate strings together

## 
##  !"&'(),-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~

print(vocab_size)

## 80


# Create mapping from int to str and str to int

stoi = {ch:i for i,ch in enumerate(chars)} # Map characters to integers like a lookup table
itos = {i:ch for i,ch in enumerate(chars)} # Mapp integers to characters (i.e., backtransform) like a lookup table
encode = lambda s: [stoi[c] for c in s] # Take a string and output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # Take a list of integers and output a string

Let’s now take a peek at what the script looks like to glimpse its broad structure in text form:

print(text[1:990])

## BLACK SCREEN
## 
## SUPER: New Line Cinema Presents
## 
## SUPER: A Wingnut Films Production
## 
## BLACK CONTINUES... ELVISH SINGING....A WOMAN'S VOICE IS
## whispering, tinged with SADNESS and REGRET:
## 
##                     GALADRIEL (V.O.)
##               (Elvish: subtitled)
##           "I amar prestar sen: han mathon ne nen,
##           han mathon ne chae...a han noston ned
##           wilith."
##               (English:)
##           The world is changed: I feel it in the
##           water, I feel it in the earth, I smell it
##           in the air...Much that once was is lost,
##           for none now live who remember it.
## 
## SUPER: THE LORD OF THE RINGS
## 
## EXT. PROLOGUE -- DAY
## 
## IMAGE: FLICKERING FIRELIGHT. The NOLDORIN FORGE in EREGION.
## MOLTEN GOLD POURS from the lip of an IRON LADLE.
## 
##                     GALADRIEL (V.O.)
##           It began with the forging of the Great
##           Rings.
## 
## IMAGE: THREE RINGS, each set with a single GEM, are received
## by the HIGH ELVES-GALADRIEL, GIL-GALAD and CIRDAN.
## 
##

Beautiful.

As a final data preparation step, we also need to encode our data as a tensor) so PyTorch knows how to handle it. We will also produce a basic train-validation split while we are at it. Since our model is a super-basic bigram that generates future characters sequentially¹ we don’t need to randomly split the data in the tensor—we can just cut it at some \(\%\) through the data since we want to generate future values. here we’ll choose \(90\%\).

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

2. Hyperparameters

We now move onto hyperparameters which govern the structure of our model and also how it learns. We are going to define several here:

batch_size—number of independent sequences processed in parallel. Batching is important as feeding in the entire dataset at once is very computationally intensive and a large burden on memory.
block_size—maximum context length for predictions (i.e., total maximum number of characters the model can see at any time). This means the model will see anywhere between 1 and block_size number of characters.
max_iters—number of iterations (i.e., epochs) to train the model for.
eval_interval—every eval_interval number of iterations up until max_iters we evaluate on validation loss instead of training loss to inject out-of-sample variability and learning into the model.
learning_rate—how big the step change is when the model updates parameters in response to errors. Smaller learning rates usually work better for deeper neural networks.
device—tells PyTorch whether to use a CUD-enabled GPU if possible, otehrwise use the computer’s CPU.
eval_iters—number of evaluation iterations to compute loss for when model is in evaluation mode.
n_embd—number of embedding dimensions.
n_head—number of self-attention heads to use.
n_layer—how many layers of blocks we want to implement. This is a key driver of what a neural network ‘deep’ and has immediate implications for training time.
dropout—proportion of weights to randomly set to zero to control overfitting. Introduced in this excellent paper. The intuition is to ‘sever’ \((\text{dropout}\times 100)\%\) worth of connections between tokens.

Let’s start relatively small, for now:

batch_size = 32 # Number of independent sequences processed in parallel
block_size = 256 # Maximum context length for predictions
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 192 # Number of embedding dimensions
n_head = 6
n_layer = 2
dropout = 0.2

torch.manual_seed(123) # For reproducibility

## <torch._C.Generator object at 0x128408890>

These numbers are helpful for people familiar with neural network architecture, but for others, how can we know whether these parameters represent a ‘deep’ neural network or not? Let’s estimate the total number of parameters in this network.

First, we have \(\text{vocab_size} \times \text{n_embd}\) token embedding parameters for a total of \(15,360\). We then have an additional \(\text{block_size} \times \text{n_embd}\) position embedding parameters for a total of \(49,152\). Spoiler alert for some of the subsections below, but each self-attention head has \(3 \times \text{n_embd} \times \text{head_size}\) (where \(\text{head_size}\) is calculated by \(\frac{\text{n_embd}}{\text{head_size}}\) to evenly distribute the size) parameters (one \(\text{n_embd} \times \text{head_size}\) for each of the \(Q\), \(K\), and \(V\) tensors). Since we have \(6\) heads, this is a total of \(110,592\). We then need to add a projection layer on top, which has an additional \(\text{n_embd}\) weights and the same number of bias terms (since the \(\text{head_size} \times \text{n_head}\) parameters that connect to \(\text{n_embd}\) outputs are already accounted for in the multi-head calculation before). This total for the attention mechanism is therefore \(110,976\). We then add the parameters for the feedforward process, which contains \(\text{n_embd} \times (4 \times \text{n_embd}) + \text{n_embd}\) weights and the same number of bias terms, for a total of \(295,296\) parameters.

Putting it all together, one block (or layer) of the neural network has \(15360 + 49152 + 110976 + 295296 = 470,784\) parameters. However, we currently have two layers, so we need to multiply that total by two, for a sum of \(941,568\). Pretty big for a model that I’m training on a personal computer! Hopefully it’s becoming clearer already that the number of layers is a key driver in how ‘big’ or ‘deep’ a neural network goes. We’ll explore this concept empirically later. Let’s move on.

3. Batching

Please see my previous blog post for a more detailed breakdown on what batching is and why it’s important. For the purposes of this post, just know that the code below produces a single batch—meaning chunks of block_size characters are selected at random from either the train set or validation set and converted into a batch_size \(\times\) block_size tensor.

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size, )) # Select random chunks with batch_size offsets
    x = torch.stack([data[i:i+block_size] for i in ix]) # First block_size characters starting at i
    y = torch.stack([data[i+1:i+1+block_size] for i in ix]) # Offset by 1 of x (i.e., just +1 for next character)
    x, y = x.to(device), y.to(device) # Pass to device depending on GPU or CPU
    return x, y # torch.stack produces a row, so this gives us a 4x8 tensor (i.e., batch_size x block_size)

4. Self-attention

This section is pretty long. We are going to decompose it into four subcomponents and step through them one-by-one:

Single head of self-attention
Multi-head attention
Feedforward method
Transformer block

Before we do that, let’s briefly discuss what self-attention is and why it’s important. For a more detailed treatment, please see my previous blog post. In 2017, Google humbly dropped this landmark paper which introduced the neural network architecture known as a transformer and one of its key components: attention. You can think of attention as being a way of encoding how much importance (or attention) each token (i.e., character, in our case) places on the other tokens in a given sequence of length block_size. For example, if a vowel is the most recent character in a sequence, it might care more about finding consonants in its block_size past in order to build a better guess at what the word might be. Attention can be broadly decomposed into two types: (i) self-attention (our focus here); and (ii) cross-attention (out-of-scope here). Please see that Google paper for more on cross-attention and its relationship with encoder-decoder operations.

In sum: without attention, our model will likely just generate nonsense—we need it in order to account for the relative importance of other tokens around the one from which we are generating.

Single head of self-attention

As a first step, we are just going to code up a single ‘head’ (yes, that’s what they are called) of attention. We can build complexity on top of this shortly. The single head is composed of two broad components: (i) model layer phase; and (ii) what a forward pass looks like. The PyTorch code below implements this (see an older blog post of mine for a guide on PyTorch syntactical structure). In general, most of the modules for our neural network in PyTorch consist of defining a class (which usually inherits type nn.Module from PyTorch) and then our layers, followed by a forward pass function (i.e., what a single run through the algorithm looks like), and then any other functions we want our class to do, such as generate new text for some input text from the user (more on that later).

With respect to the layer phase, for self-attention, this procedure instantiates linear connections to produce the beginnings of our attention mechanism, as described in the Google paper:

\[ \text{Attention}(Q, K, V) = \text{softmax}\frac{QK^{T}}{\sqrt{d_{k}}}V \]

namely, the key, query, and value tensors that will become \(K\), \(Q\), and \(V\) in the forward pass later. We are also adding a register buffer to store information that PyTorch does not change, and engaging dropout to protect against overfitting in the forward pass. Speaking of the forward pass, you’ll note we are basically just following the steps outlined in the Google paper (once again) that I discussed in detail in my previous post. First, we extract the \(B\) (batch size), \(T\) (‘time’; i.e., block_size), and \(C\) (‘channels’; i.e., number of embedding dimensions) dimensions from the input data x, then call our key and query operations to produce \(K\) and \(Q\). We get the weights by computing the attention equation above through scaled dot products, compute a softmax on a masked (i.e., lower triangular to prevent early tokens in the sequence from having connections with ones that come after them—this would not be an autoregressive process if lookahead was allowed) version of the tensor to produce weights for each row that sum to \(1\) to govern attention, drop out \(20\%\) of the connections, and finally compute the dot product between our values and weights to get the final numbers we need. That’s a lot of info, but the scaled dot product attention approach is quite intuitive if you read the paper or my last post.

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout) # Prevents overfitting

    def forward(self, x):
        B,T,C = x.shape
        K = self.key(x) # Has shape (B,T,head_size) due to linear mapping above
        Q = self.query(x) # Has shape (B,T,head_size) due to linear mapping above
        wei = Q @ K.transpose(-2,-1) * K.shape[-1]**-0.5 # Produces (B,T,head_size) @ (B,head_size,T) = (B,T,T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei) # Randomly prevents some nodes (i.e., tokens) from communicating
        V = self.value(x)
        out = wei @ V
        return out

Multi-head attention

Above we defined a single head of attention. However, we are in the business of scaling our model, and so we are going to follow Google’s example and extend to a multi-head setting for even more learning flexibility and generative capability. The intuition here is that we are going to run several heads in parallel and concatenate their results. The code to do this is quite simple, and basically consists of us creating a ModuleList to store multiple Head objects, instantiating a linear transformation term and the dropout amount. Our forward method applies the heads and concatenates the results, applies the linear transformation to dimension n_embd, and then applies dropout to the resulting tensor.

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1) # -1 concatenates over the Channel dimension
        out = self.dropout(self.proj(out))
        return out

Feedforward method

Our next step is to define a feedfoward method for a simple linear layer followed by a non-linearity. This helps the neural network learn complex patterns in the data. We are basically following the advice of the Google paper here, especially with respect to some of the settings, such as the \(4 \times \text{n_embd}\) number of neurons that the first linear layer projects to. Note that the basic flow here is to construct a linear mapping of n_embd number of neurons to \(4 \times \text{n_embd}\) number of neurons, perform a non-linear rectified linear unit activation, project back to n_embd number of neurons linearly, and then apply dropout.

class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd), # 4* multiplier comes from the paper where they went from 512 to 2048 (in their case)
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd), # Projection layer going back into residual pathway
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

Transformer block

Here is where some of the super interesting transformer architecture heats up. Our transformer block will essentially define a procedure for communication between tokens (i.e., using multi-head attention) followed by computation (feed forward through the network). First, we calculate the size of each multi-attention head by dividing the number of embedding dimensions by the number of heads. We then initiate the self-attention process by invoking our MultiHeadAttention class defined earlier. Similarly, we initiate the feedforward process as well. The next two lines of code are a new development, however. We are applying layer norms to help optimise the neural network. Research back in 2016 empirically demonstrated that applying normalisation to layers can drastically reduce the training time of deep neural networks. Mathematically, the layer norm is written as:

\[ y = \frac{x - E[x]}{\sqrt{\text{Var}[x] + \epsilon}} \times \gamma + \beta \]

which you’ll note is incredibly similar to the formula for a zscore, and you would be correct in assuming that the layer norm also standardises the data to have zero mean and a standard deviation of \(1\). However, note the extra parameters \(\gamma\) and \(\beta\). The inclusion of these learnable parameters means that the final layer norm result \(y\) may not be unit Gaussian (i.e., mean of zero and standard deviation of \(1\)), but the optimisation process will determine this. Back to our code, we are creating two layer norms—one for the self-attention process to capture the residual connections and another for the feedforward process. We simply sum these together to complete the block.

class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x)) # Residual connections
        x = x + self.ffwd(self.ln2(x))
        return x

5. Model

This section is a beast, apologies, but hopefully the code is intuitively annotated. Our model follows the same broad PyTorch structure I introduced earlier: (i) define the layers; (ii) define the forward process; and (iii) define other functions the model class should perform. Let’s start with the layers. First, we are going to create embedding tables for both the tokens using the total vocabulary size and number of embedding dimensions and the positions within blocks and the number embedding dimensions. From here, we can define the sequential procedure that creates each block for each layer (remembering that every Block() call itself calls MultiHeadAttention). We then apply a layer norm for regularisation and define a final linear mapping which connects the number of embedding dimensions as neurons to our vocabulary size as we want to end up with a probability distribution of the next character in the sequence over all possible characters.

We now move on to the forward definition. We start by invoking the token and position embedding tables, taking care to use the correct shape (i.e., dimensions). We then create our key data container x by summing the two embedding tables and then passing it into the blocks and then the linear projection to get logits for each token which we will later convert to probabilities. We then define a conditional section with an argument targets that defaults to None. Basically, if targets is not None, we get the full loss evaluation procedure by computing the cross-entropy loss (or the negative log-likelihood, for the statistically inclined like me!) between our logits and the actual target labels (i.e., correct character tokens). However, if targets=None, then we do not need to evaluate the loss and instead save ourselves the computational effort. This is used for the final function in the model: generate.

The last step in our model is a new function which generate new tokens from the input data. The generate function produces new tokens in the expected sequence from input data (i.e., text generation). While the syntax is pretty specific to PyTorch, the general idea is this:

Calculate logits for each new token up to max_new_tokens (i.e., number of new characters to generate)
Convert the logits to probabilities to get a nice distribution over the vocabulary
Draw one sample from the multionomial distribution governed by the probabilities for each token. We then append the result to our running sequence of tokens and repeat until max_new_tokens is reached. The appending is important because we are need each preceding character to predict the most likely next one in the future

class BigramLanguageGenModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B, T, n_embd) -- i.e., C = n_embd
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # Intuition: self-attention happens then feedforward on the per token level gives them time to 'think' on the connections
        logits = self.lm_head(x) # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape # Need to reshape the tensor as PyTorch's cross_entropy loss expects a B,C,T tensor
            logits = logits.view(B*T, C) # 2-dimensional array (basically by stretching the array but preserving the C dimension)
            targets = targets.view(B*T) # Equivalent to just writing -1
            loss = F.cross_entropy(logits, targets) # Negative log-liklihood loss function for measuring quality of logits with respect to targets
        return logits, loss # NOTE: In our case, B=batch_size, T=block_size, C=vocab_size
    
    # Generation function
    # INTUITION: Take the idx (B,T) array of indices and extend it to become (B,T)+1, (B,T)+2, etc.
    
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:] # Due to conditional embeddings we need to crop context as it only goes up to block_size
            logits, loss = self(idx_cond) # This refers back to the forward function
            logits = logits[:, -1, :] # Focuses on last time step to become (B,C)
            probs = F.softmax(logits, dim=-1) # Softmax gives us probabilities
            idx_next = torch.multinomial(probs, num_samples=1) # Grab just one sample from the probability distribution
            idx = torch.cat((idx, idx_next), dim=1) # Append sampled index to the running sequence for (B, T+1)
        return idx
    
model = BigramLanguageGenModel()
m = model.to(device)

6. Loss estimation

In the above model section of our algorithm we described the computation of loss. In order to track our loss progress during the training loop (which comes next!) we need to define a loop that either calculates the average loss across batches on the training set or the evaluation set. In order to be computationally efficient, we disable backpropagation by specifying the @torch.no_grad() (i.e., ‘no gradient’) context manager. We also set the model to evaluation mode to explicitly tell PyTorch we are not computing. Note that we reset it to training mode at the end. This is one of those interesting cases where PyTorch models ‘accumulate’ information as the classes are called. This makes our code simpler since we do not always need to manually specify and append information, but it also means we need to be careful when trying to optimise computational efficiency.

@torch.no_grad() # Context manager to disable PyTorch's .backward() function to improve memory efficiency
def estimate_loss():
    out = {}
    model.eval() # Set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # Reset model to training mode
    return out

7. Training loop

Final step! We are almost at the point of this blog post which is to experiment with hyperparameters. The last thing we need to do is define a custom training loop for the model. For our purposes here, this is a pretty simple task, but in real applied settings these training loops can get very complex. We first need to define the optimiser that will be used to update our parameters based on gradient calculations. We are going to use the AdamW optimiser which is a variant on the popular and effective Adam optimiser (short for ‘Adaptive Moment Estimation’) where weight decay does not accumulate in the momentum nor variance. We then define a loop over the maximum number of iterations² where we occasionally evaluate on the validation loss instead of the training loss to get a sense of overfitting. The actual neural network part of our loop is pretty straightforward: we first batch our data, then compute logits and loss by calling the model, update the gradient calculations, then get the optimiser to take a step (hopefully) in the right direction of the global minima.

Let’s run the training loop now for our current hyperparameters:

optimizer = torch.optim.AdamW(model.parameters(), lr = learning_rate)

for iter in range(max_iters):

    # Every once in a while evaluate the loss on train and validation sets

    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val_loss {losses['val']:.4f}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad() # Zero gradients from previous step
    loss.backward() # Compute gradients
    optimizer.step() # Adjust parameters according to gradient

## step 0: train loss 5.0668, val_loss 5.0616
## step 500: train loss 1.9482, val_loss 1.9879
## step 1000: train loss 1.7394, val_loss 1.7833
## step 1500: train loss 1.5247, val_loss 1.5592
## step 2000: train loss 1.3954, val_loss 1.4502
## step 2500: train loss 1.3154, val_loss 1.3681
## step 3000: train loss 1.2611, val_loss 1.3102
## step 3500: train loss 1.2073, val_loss 1.2609
## step 4000: train loss 1.1511, val_loss 1.2192
## step 4500: train loss 1.1160, val_loss 1.1914

Look at that loss decline! How is the text generation? Let’s generate \(500\) new characters of Fellowship of the Ring script text³:

context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

## 
##           A Gwalf?
## 
## CLOGLOSE ON: FRODO'S CHAMBITER tocking forell...a smovellives pusit if
## his lard of hand in pearined.
## 
##                      GAND.
## 
## FALACH loomins a boits the couling from. Frodo, evening the Frodo rintle
## liesemshuwn his head sits un Eldo ition wash tof look out.
## 
##           moold vendswevever can
##            olcold a o that poce a stour alEpinto der
##           to you deses your as with
##             Bilbo      pace , one thing thak him dolf the
## 
##                    MELROMAN
##

While many of the ‘words’ are English nor Elvish, the overall structure of the output, including spacing, line breaks, and capitalisation at least resembles the movie script. This is a good starting point!

With all of that madness of development out of the way, we are now going to run some experiments to see how our loss and general output text varies in order to build intuition for what increasing ‘depth’ of the neural network in various ways does for us. It can be tempting to just monitor improvements in loss as it resembles an objective numerical benchmark, however, with generative models—especially with text—I find it helpful to also monitor the quality of generated outputs. If nothing else, it gives you qualitative and semantic information to triangulate with the loss results to build a considered picture of performance, but it can also help you understand how the model is learning with respect to changes in generated words or overall structure. In deep learning, I find this sort of ‘intuition’ invaluable, especially as the models get more and more complex and their exact mechanics become trickier to grasp.

Experiment 1: Modifying the embedding dimension

First up, we are going to modify the embedding dimension to see the impact on our loss value and the generated text output. Recall that so far we have used an embedding dimension of \(192\). What if we say doubled that to \(384\)? Note that this increases computation time, of course. We’ll also functionalise the loss caclculations to make these experiments cleaner by passing in a model name rather than a hardcoded object.

n_embd = 384
model_384 = BigramLanguageGenModel()
m_384 = model_384.to(device)

@torch.no_grad() # Context manager to disable PyTorch's .backward() function to improve memory efficiency
def estimate_loss2(mymodel):
    out = {}
    mymodel.eval() # Set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = mymodel(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # Reset model to training mode
    return out

# Run training loop

optimizer_384 = torch.optim.AdamW(model_384.parameters(), lr = learning_rate)

for iter in range(max_iters):
    if iter % eval_interval == 0:
        losses_384 = estimate_loss2(model_384)
        print(f"step {iter}: train loss {losses_384['train']:.4f}, val_loss {losses_384['val']:.4f}")
    xb_384, yb_384 = get_batch('train')
    logits_384, loss_384 = model_384(xb_384, yb_384)
    optimizer_384.zero_grad()
    loss_384.backward()
    optimizer_384.step()

## step 0: train loss 4.8422, val_loss 4.8359
## step 500: train loss 1.6674, val_loss 1.7186
## step 1000: train loss 1.3247, val_loss 1.3684
## step 1500: train loss 1.1614, val_loss 1.2567
## step 2000: train loss 1.0563, val_loss 1.1944
## step 2500: train loss 0.9671, val_loss 1.1707
## step 3000: train loss 0.8747, val_loss 1.1697
## step 3500: train loss 0.8022, val_loss 1.1867
## step 4000: train loss 0.7183, val_loss 1.2459
## step 4500: train loss 0.6500, val_loss 1.3048

We see a marked improvement in training loss where we are now down to well below \(1\)! Though we did see an increase in validation loss in later iterations. The training loss result is pretty compelling evidence that the doubling of embedding dimensions (and therefore a drastic increase neural network size and complexity) is warranted. Let’s see if the generated text looks any more realistic as a result:

print(decode(m_384.generate(context, max_new_tokens=500)[0].tolist()))

## 
##                                                      (CONTINUED)
## 
##                                                              92.
## CONTINUED:
## 
## 
## FIVENDELL... The towards gass shas shed, Frodo ruffelt! CLOSE ON:
## The RIVEN... The boaders against shirt cription....
## 
##                     GANDALF (CONT'D)
##       Do you rid Bag in Ull geir, moving
##           them Dalke!
## 
##              STRIDER
##               (ELVIS (V.O.) (CONT'D)
##           But they press broken for the with head th.
## 
##                     GAN

Woah! Generated text is now starting to resemble the actual script a lot closer, though we obviously still have a lot of nonsense in there. An improvement, for sure, though.

Experiment 2: Modifying the number of layers

So far we have modified the size of the embedding dimension. Let’s keep that change, but this time, as a final experiment, also double the number of layers from two to four. This is a substantial increase in model size and the training time is reflective of that. Let’s see how we do:

n_layer = 4
model_4_layers = BigramLanguageGenModel()
m_4_layers = model_4_layers.to(device)

# Run training loop

optimizer_4_layers = torch.optim.AdamW(model_4_layers.parameters(), lr = learning_rate)

for iter in range(max_iters):
    if iter % eval_interval == 0:
        losses_4_layers = estimate_loss2(model_4_layers)
        print(f"step {iter}: train loss {losses_4_layers['train']:.4f}, val_loss {losses_4_layers['val']:.4f}")
    xb_4_layers, yb_4_layers = get_batch('train')
    logits_4_layers, loss_4_layers = model_4_layers(xb_4_layers, yb_4_layers)
    optimizer_4_layers.zero_grad()
    loss_4_layers.backward()
    optimizer_4_layers.step()

## step 0: train loss 4.7106, val_loss 4.7204
## step 500: train loss 1.4247, val_loss 1.4720
## step 1000: train loss 1.1388, val_loss 1.2538
## step 1500: train loss 0.9537, val_loss 1.1782
## step 2000: train loss 0.7893, val_loss 1.2175
## step 2500: train loss 0.6085, val_loss 1.3464
## step 3000: train loss 0.4129, val_loss 1.6046
## step 3500: train loss 0.2692, val_loss 1.9314
## step 4000: train loss 0.1865, val_loss 2.2159
## step 4500: train loss 0.1510, val_loss 2.4560

We see another marked improvement in training loss, but a worsening of performance in terms of validation loss as the iterations went on. This suggests overfitting and that either our deep neural network is too deep, or we trained for too many iterations. For example, it appears that if we stopped at around \(3000\) iterations, we might have struck a decent balance between a meaningful improvement in training loss but also not too great of validation loss.

For posterity, here is generation of \(500\) new tokens from the model:

print(decode(m_4_layers.generate(context, max_new_tokens=500)[0].tolist()))

## 
## 
##                                                               100.
## 
## 
## EXT. ORCONTING POLLOW -- NIGHT
## 
## The path turns of BARAD-DUR. THE DARK ORKS and AT xasters across the
## wolden Breen an Elves on mans life in a
## dark tunnels on the mushrooms room racing of again teland in the
## night.
## 
##                     GATEKEEPER
##           Fallowship the creature Gandalf... What
##           did what is my for my op or man
##           sping into the Shire...who that
##           No in that is ben bound to the us of
##

Experiment 3: Reducing the number of layers

Finally, let’s tinker with a three layer model instead of four and see if we don’t overfit as much⁴:

n_layer = 3
model_3_layers = BigramLanguageGenModel()
m_3_layers = model_3_layers.to(device)

# Run training loop

optimizer_3_layers = torch.optim.AdamW(model_3_layers.parameters(), lr = learning_rate)

for iter in range(max_iters):
    if iter % eval_interval == 0:
        losses_3_layers = estimate_loss2(model_3_layers)
        print(f"step {iter}: train loss {losses_3_layers['train']:.4f}, val_loss {losses_3_layers['val']:.4f}")
    xb_3_layers, yb_3_layers = get_batch('train')
    logits_3_layers, loss_3_layers = model_3_layers(xb_3_layers, yb_3_layers)
    optimizer_3_layers.zero_grad()
    loss_3_layers.backward()
    optimizer_3_layers.step()

## step 0: train loss 4.5071, val_loss 4.5042
## step 500: train loss 1.5091, val_loss 1.5573
## step 1000: train loss 1.2208, val_loss 1.2885
## step 1500: train loss 1.0517, val_loss 1.1953
## step 2000: train loss 0.9020, val_loss 1.1587
## step 2500: train loss 0.7637, val_loss 1.1968
## step 3000: train loss 0.6106, val_loss 1.3210
## step 3500: train loss 0.4594, val_loss 1.5273
## step 4000: train loss 0.3245, val_loss 1.7633
## step 4500: train loss 0.2297, val_loss 2.0562

Fascinating! We clearly have a case for a stopping rule based on validation loss or iterations. I believe we could keep the current neural network depth as it clearly seems to be learning the text structure reasonably well, but we definitely need some guardrails in place to protect against overfitting.

Finally, here is some generated text from the model as a parting detail:

print(decode(m_3_layers.generate(context, max_new_tokens=500)[0].tolist()))

## 
## 
## 
## 
##                                                         (CONTINUED)
## 
##                                                          99.
## CONTINUED:
## 
## 
##                     SAM
##           Did you hear for the hewere cing
##           frendon's craful.
## 
##                     BILBO
##           If you're referring, Sammos!
## 
##                     ARWEN
##           And the lord, Frodo.
## 
##                                                         50.
## 
## 
## 
## EXT. CARAS GALADHON LAWN -- NIGHT
## 
## Wide on: HOBBITON...shrouded of the

Parting thoughts

Yikes that was a lot! This is undoubtedly my longest blog post yet. Despite all the craziness, I hope at least some of the ideas and takeaway messages were interesting and clear!

Note that this can be thought of as a basic kind of autoregressive process.↩︎
NOTE: We say ‘maximum’ because it is common in many settings to define certain stopping rules if certain performance thresholds are reached or if no improvement in loss is achieved over a certain number of iterations.↩︎
Bear in mind that a bigram language model is NOT going to produce stellar language in this context. The important thing is that it gets the broader structure right, such that on first glance it appears to be generating text similar to the script.↩︎
NOTE: There are absolutely cleaner, more functional ways to iterate model architectures in Python than this, but this way makes it clear for the purpose of a blog post.↩︎

Share on

Twitter Facebook LinkedIn

Trent Henderson