🚀 Building GPT in Pure Python — The 243-Line MicroGPT Explained

“The most atomic way to train and inference a GPT in pure, dependency-free Python. Everything else is just efficiency.” — Andrej Karpathy

In this article, we will break down Andrej Karpathy’s legendary 243-line MicroGPT into simple, understandable sections.

No PyTorch. No GPUs. No deep learning libraries.

Just mathematics, Python, and probability.

Let’s go step by step.

1️⃣ The Dataset — What Are We Training On?

if not os.path.exists('input.txt'):
    import urllib.request
    names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'
    urllib.request.urlretrieve(names_url, 'input.txt')

docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)

What’s happening?

We download a dataset of 29,494 names
Each name becomes a training example
We shuffle them for randomness

Example names:

emma
liam
olivia
noah

This GPT does character-level learning.

2️⃣ Tokenization — Turning Characters into Numbers

uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

What’s happening?

We extract all unique characters
Each character gets an ID
Add special token: BOS (Beginning of Sequence)

Example:

a → 0
b → 1
c → 2
...

This is our vocabulary.

Unlike big GPTs, this model:

Does NOT use subword tokens
Does NOT use BPE
Uses only raw characters

3️⃣ Autograd — Building Our Own Backpropagation Engine

The Value class is the heart of this file.

class Value:

This implements:

Forward pass
Computational graph
Backward pass
Chain rule

Every number in the model is wrapped inside a Value.

Example:

a = Value(2.0)
b = Value(3.0)
c = a * b

Now:

c.data = 6
Gradients can be computed automatically

When we call:

loss.backward()

The chain rule flows backward through every operation.

This is a mini PyTorch, written from scratch.

4️⃣ Model Parameters — What Does GPT Learn?

n_embd = 16
n_head = 4
n_layer = 1
block_size = 16

We define:

Parameter	Meaning
n_embd	Embedding size
n_head	Attention heads
n_layer	Transformer layers
block_size	Max sequence length

Parameter Matrices

state_dict['wte']  # token embeddings
state_dict['wpe']  # positional embeddings
state_dict['lm_head']  # output layer

And for each layer:

Query weights
Key weights
Value weights
Output weights
MLP layers

All initialized randomly:

random.gauss(0, std)

At this stage, the model knows nothing.

5️⃣ Core Architecture — The GPT Function

def gpt(token_id, pos_id, keys, values):

This is the Transformer.

It follows GPT-2 style architecture:

Step 1: Token + Position Embedding

x = token_embedding + position_embedding

This gives meaning + position awareness.

Step 2: RMSNorm

Instead of LayerNorm, Karpathy uses RMSNorm:

x = rmsnorm(x)

This stabilizes training.

Step 3: Multi-Head Attention

For each head:

attn_logits = q ⋅ k / sqrt(head_dim)
attn_weights = softmax(attn_logits)
head_out = weighted sum of values

This allows the token to:

Look at previous tokens
Learn relationships
Understand context

Even at character level!

Step 4: MLP Block

x = linear → ReLU → linear

This allows:

Nonlinear transformation
Pattern learning
Feature extraction

Step 5: Output Layer

logits = linear(x, lm_head)

This produces:

One score per possible next character

Not probabilities yet — just logits.

6️⃣ Training — Teaching GPT to Predict the Next Character

Each name becomes:

[BOS] a n n a [BOS]

At each position:

Input: current character
Target: next character

Loss:

loss_t = -log(prob[target])

This is cross-entropy loss.

The model learns by minimizing:

Average negative log likelihood

7️⃣ Backpropagation

loss.backward()

This computes gradients for every parameter using:

Chain rule
Computational graph
Reverse topological order

This is pure math.

8️⃣ Optimizer — Adam

m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2

Adam uses:

Momentum (first moment)
Variance tracking (second moment)
Bias correction

Parameters update:

p.data -= learning_rate * adjusted_gradient

Training runs for:

1000 steps

9️⃣ Inference — Generating New Names

temperature = 0.5

We start with BOS and repeatedly:

Get logits
Apply temperature scaling
Softmax
Sample next token
Repeat

If BOS is generated → stop.

Example output:

anita
kane
torian
kalana

Many of these:

Are readable
Sound realistic
Were never in the training set

This is generative behavior.

🔥 What This MicroGPT Proves

1️⃣ GPT does not store words 2️⃣ It learns probability distributions 3️⃣ Creativity emerges from sampling 4️⃣ Temperature controls randomness 5️⃣ Transformers are just math + matrices

🧠 Why This Is Special

Because:

The entire algorithm fits in one file
No black boxes
No hidden libraries
Every line is understandable

This is transformer architecture in its most atomic form.

🏁 Final Thought

Large GPTs have billions of parameters.

This one has only thousands.

Yet the core idea is identical:

Attention + MLP + Backpropagation + Probability Sampling

Everything else is scale and efficiency.

Thanks to Andrej Karpathy, we can witness generative intelligence emerge from 243 lines of pure Python.